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NOTE ON THE RELIABILITY OF A TEST: 
A REPLY TO DR. CRUM’S CRITICISM 


TRUMAN L. KELLEY 
Stanford University, Cal. 


Replying to Dr. Crum’s article! I will discuss the phases dealing 
with unequal variability of tests and the use of the Spearman-Brown 
Formula, after first considering certain more general questions. 

Dr. Crum’s argument is lucid and easily leads to his conclusions. 
However, it seems to me that he has overlooked certain important 
principles. Let me refer to the first and second paragraphs on page 
300 of his article. There are several statements in this second para- 
graph which are quite contrary to my concept of reliability and which 
have far-reaching logical implications. In place of the third, fourth, 
and last sentences in this second paragraph, I would subscribe to the 
following: 

1. The test really consists of a number of samplings of a hidden 
function which has many modes of expression, the various questions 
constituting different modes of approach. 

2. Clearly, in the nature of the case, it is desirable to have as low 
correlation of the individual questions with each other as possible. 

3. Thus the practical problem of estimating reliability is of great 
moment for it gives a measure of the residuum common to many 
questions, which residuum isin general, by definition, the function that 
concerns us. 

I do not consider the distinction that Dr. Crum draws between the 
mental test and the accomplishment test a valid one, in its bearing 
upon reliability. I can make my point clear by citing any one of the 
tests in the Stanford Achievement Battery. The tests in this battery 


1 Crum, W. L., Note on the Reliability of a Test, with Special Reference to the 
Examinations set by the College Entrance Board. The American Matherna/ical 
Monthly, Vol. xxx, No. 6, Sept.. Oct., 1923. 

2 See page 204 of this article. 
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are as closely connected with the subject-matter of instruction inthe 
elementary school as it was possible for the authors (Terman, Ruch, 
and myself) to make them. For instance, we searched elementary 
school arithmetics, courses of study, and other published statements 
to establish the content represented by Arithmetical Computation. 
We then drew up problems, each one devised to measure a separate 
phase of this content, which, you will note, is the opposite of the 
procedure suggested by Dr. Crum. In spite of our procedure as 
described we have a test with high reliability. I say in spite of because 
high reliability was not an end in itself, the end was to measure com- 
putation, and that we obtained while pursuing this end, a score with 
high reliability is gratifying as it is a guarantee that the thing measured 
is well measured. Our endeavor to tap various phases of the subject 
resulted in securing a residuum which is entitled to being called 
computation. It is in fact because there is something that is common 
to the various problems that we may use a single term to describe it. 
It is because this thing that is common exists that pupils sometimes 
make generalizations, that transfer of training is possible, and that an 
examiner is able to predict future accomplishments from a knowledge 
of past attainment. If there are features, specific to the separate 
problems, thus having no relation to anything else, they are unimpor- 
tant to the examiner except that they are sometimes in his way. Let 
me hasten to state that in a given computation test there may be 
problems which are possessed of features not found in any other of 
the problems in the same test, which features may still be related 
to the general field of computation. In so far as this is true the test 
will tend to have a reliability for the examiner’s purposes which is 
greater than the 71, given by the Spearman-Brown Formula. This 
is a point I have stressed in my text “Statistical Method” and upon 
which I, of course, agree with Dr. Crum. I shall return to it, but here 
I wish to make clear the fact that a feature having no point in common 
with anything else cannot be considered an essential part of any of 
those abilities that appertain to the subject in general, be it computation 
ability, spelling ability, history, chemistry, geometry, algebra, etc. 

Some one may argue that a subject-matter test is not a test of 
ability, but of past accomplishment—that spelling ‘‘Przemysl”’ is a 
spelling’ accomplishment whether it has any bearing upon future 
spelling or not. Granting this, it is still a fact that every subject- 
matter test is both an indication of past accomplishment and of future 
promise and it is only because it is this latter, and insofar as it zs this 








\e 


oe ee a 


ed 
est 
is 
his 
yon 
ere 
10n 
r of 
ion 
atc. 
; of 
is & 
jure 
ect- 
ure 
this 





The Reliability of a Test 195 


latter, that it has any value in problems of educational guidance and 
classification. Let me at this point enumerate the conclusions I 
would draw from the preceding argument: 

1. If an exercise contains some feature which is sheckatele unique, 
the fact that we do not measure it in obtaining a reliability coefficient 
is not only no drawback but a decided asset to one who uses the relia- 
bility coefficient in estimating the accuracy of a prediction based on 
the score on the exercise. 

2. If an exercise contains a feature which is unique so far as the 
other exercises in the test are concerned, but not unrelated to the 
general field of the subject, then (a) the Spearman-Brown! r,; will tend, 
on this account, to be too small, but (6) the feature not being totally 
unique it should be possible (though at times difficult) to draw up a 
second exercise measuring the same feature, thus the devising of a 
second similar form of the test is a possibility. 

3. If two or more exercises contain common features, not found in 
the general field, then the Spearman-Brown 1r,, will tend on this 
account to be too large. 

We may state the problem in algebraic terms. Let us first state 
the simplest case—the one that offers no difficulties. Let X,, the 
score on a test (first form) be a function of ability, A and an error, F,, 
and let us suppose that these two things are entirely independent of 
each other. We may then write 


X, = f(A) + o(F,) 


If the mean ability of the group studied is large and the variability 
of the group not great with reference to this mean and if the scoring 
scheme permits of many graded amounts we would expect X, to be 
approximately a linear function of A and also a linear function of 
the chance factor: so we will write 


X1=CG.A+C.E+C 


in which the C’s are constants, the score X,, the dependent variable, 
and the chance error E, and the ability A independent variables, 
changing from individual to individual. 





1 By Spearman-Brown reliability coefficient I mean an ry, derived by the 
formula 
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Let X,1 be the mean score for the group, A the mean amount of 
ability for the group, and F£; the mean chance factor. Let us express 
the preceding equation from the means of the variables as origins. 
Let 2; = Xi: — X1; a =C,A —C,A, and e, = C.F, oo C. Ei. We 
then have, transferring origins to the means 

%1=a aa ei 


That is to say, 21, the score as a deviation from the mean is equal to a, 
the ability factor as a deviation from the mean, plus e;, the chance 
factor as a deviation from the mean. In this equation a and e; are 
entirely independent so that the standard deviation of the scores, «;, 
is given by 
227, Z(at+e,)? Za? , Ze; 
2. = — = ~ = —— —— =g? 2 
o"1 N N N + N Ca + o e, 

Let us assume that a second similar form of the test is available 

and that it is given to the same individuals, then we have 





Io =art es 


Since the tests are similar and the individuals-presumably have the 
same ability at the two sittings a in this equation is the same for each 
individual as a in the equation, x; = a + e:, but the chance factors 
vary at the two sittings so that e: ¥ e2, and therefore z, ~ x2. How- 
ever, if the tests are similar, the standard deviations of the twochance 
factors will in the long run be the same, so that o, = o., and we will 


have 


072 = 0% +o =o% +o = 07 


For the correlation between the scores on the two similar forms we have 


Zrite _ Uate)(ater) 2a* _ o% 
Noyo2 No? ~ No? @? 


In the denominator we have, since o; = a2, set o? = oj02, and in the 
numerator all product terms, except Yaa, are equal to zero because 
a, €, and é: are uncorrelated. We thus see that in this simple case the 
reliability coefficient is that proportion of the total variability, o’, 
which is due to the common ability factor a. 

We may now consider a more complex situation, first noting that 
if a score is correlated to some extent greater than zero and less than 
1.00 with a second measure, as in the preceding illustration 2, was 
correlated with a, it may always be thought of as being composed of 
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two uncorrelated parts, one part perfectly correlated with the second 
measure and the other part having zero correlation with it. Accord- 
ingly the following treatment, which will split a score up into inde- 
pendent parts, is, in fact, sufficiently broad to cover the typical 
situation of correlation between tests greater than zero and less 
than one. 

Let us suppose the score, z;, on the first test, to be a linear function 
of a, bi, c, di, e1; and the score, x2, on the second test to be a linear 
function of a, be, c, de, €2 where: 


a = a factor common to test 1, test 2, and to the general field 
represented by the subject (e.g. ‘‘Computation’’). 


b, = a factor common to test 1 and the general field, but not found 
in test 2. 

be = a factor common to test 2 and the general field, but not found 
in test 1. 


c = a factor common to test 1 and test 2, but not to the general field. 


d, = a factor unique to test 1, but not chance. 
d, = a factor unique to test 2, but not chance. 
é,; = a chance factor found in test 1. 


é. = a chance factor found in test 2. 


We may now write 
M=atht+e+at+ea 
Z.=atb+e+det+ ez 


The variables a, b;, bs, c, di, de, €1, €2 are uncorrelated. For simplicity 

of treatment let us suppose that the standard deviations of the corre- 

sponding factors of the two tests are equal, and accordingly o1 =o. 
The effect of unequal standard deviations is referred to later. 
The correlation between 2; and 2; is 


rial U(a+bi+ce+d,+ e:)(a+ be +4 de + es) = 


No wr 











La? in Ze* _(o%, + o°,) 
No? * No? — a? 


What connection is there between this value r,. and the reliability 


coefficient of z,? If it were possible (which it is not because d; is 


unique) to draw up a test, let us call it test 3, differing from test 1 only 
in the chance factor e, then 


Z=atbh+c+dat+e; 
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(07, + 0% +07 + 074) 





and by simple calculation r:; = = - It is this 


value, 713, which I believe Dr. Crum has in mind as the reliability 
coefficient. It is the most obvious way in which to define the relia- 
bility coefficient. The definition might be worded: ‘The reliability 
of a test, of standard deviation o and having a chance factor of standard 
deviation o, is equal to (1 — o?,)/c?,”’ or again; ‘‘The reliability coeffi- 
cient of a test is the coefficient of correction between the test and a 
second having an equally large chance factor and differing only 
from the first in the chance factor.”’ I do not believe this to be the 
most serviceable definition because d,, even though it is not chance, 
acts exactly as a chance factor so far as it concerns the use of xz; as a 
measure having further implications. Statistically, there is no means 
of differentiating between d and e, and, practically, I see no reason 
why one should wish to. If r;3 is taken as the definition of the relia- 
bility coefficient I would not only agree with Dr. Crum that there is 
“no justification whatever for using Brown’s Formula in the study 
of reliability of a test which covers a range of specific topics,” but 
would go further and say that if the Spearman-Brown Formula is 
avoided and the correlation between two tests calculated, it likewise 
would not be a reliability coefficient. 

It should be possible to draw up a test (let us call it test 4) differing 
only from test 1 in the unique and chance factors; then, b; is the same 
as a in its relationships so we may let a’ = a + 06,, and write 


m=a’+c+da+e 
and =a t+et+agt+e 


The correlation 74, is then equal to (o?, +.?.)/co*. Shall we call this 
a reliability coefficient? There is no factor common to one of the tests 
and the general field but not to the other test, which is as we would 
wish, but there is a c factor common to test 1 and test 4 but not to the 
general field, which makes the 7,4 too large as a measure of the relia- 
bility of the test in its bearing upon the general field. It seems to me 
that this must be called a genuine reliability coefficient, but you will 
note that the test has a reliability in measuring something which is not 
simply the general field. 

As an illustration let us suppose that the field represented by com- 
putation involves a first function to the extent of 95 per cent and a 
second function, which we will say is speed, to the extent of 5 per cent. 
Let us suppose that each of the two tests used involve the first function 
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to the extent of 19 per cent, the second function to the extent of 57 
per cent, and chance to the extent of 24 per cent. Here the weights 
of the first and second functions in the general field, as 19 is to 1, are 
not paralleled in the tests, where the ratio is 1 to 3. The correlation 
between these two similar tests is high, but the reliability is, in the 
main, based upon speed, a very minor aspect of computation, so that 
as a measure of computation either test is quite unsatisfactory. In 
brief, the tests possess a large factor c not paralleled in the general 
field. It may happen that this function c, not in the general field that 
we are considering, is an important aspect of some other worthwhile 
general field, so that the high reliability coefficient found for the test 
is a genuine indication that the test is in truth a good measure of 
something that is worth measuring, but at other times c may be a 
quite trivial function in which case the test is a good measure of 
something that it is not worth while measuring. Though we are 
compelled to accept ri, as a genuine reliability coefficient it is definitely 
too high for the purposes for which we wish to use the test because 
it contains the factor c not in the general field. The conclusions 
from this discusion are that not only should we be careful that our 
test measures what we wish to measure, but we should be equally 
careful that it does not measure other things. 

For the purposes of informing us of the accuracy with which test 1 
measures the general field the reliability that we would wish is (o?, + 
o*,,)/o? and the coefficient that we actually obtain when we correlate 
z, with 22 is (0%, + 0%,)/o?. The real issue in a practical situation is 
whether o*,, is larger or smaller than o?.. I can only give my opinion 
in this matter as the thorough-going statistical study of the question 
is still to be made. 

Where there are a small number of questions in a test and where 
a second test is drawn up irrespective of the first, or possibly specifi- 
cally with a view to avoiding the subject matter of the first, it seems 
reasonable to expect a substantial b factor and no especial reason 
to expect a substantial c factor. Under these quite common condi- 
tions I would expect o, to be greater than o, and thus the obtained riz 
to be too small as a genuine reliability coefficient. 

The situation is quite radically changed when a test is split in two, 
the halves correlated, and the Spearman-Brown Formula used to 
obtain anri:. There is still reason to expect substantial b factors, but 
there is now much reason to expect a substantial c factor as well. 
To mention a few reasons for this: At a single sitting one’s buoyancy 
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and vigor in attacking the odd problem is probably much the same as 
that in attacking the even problems. A feeling of discouragement 
because of failure on a problem probably carries over to the next problem 
thus increasing the chances of failure on it. One’s general tone, up 
or down, as a result of a good sleep, a poor breakfast, etc., is the same 
on the odd as on the even problems. Probably the practice effect of 
earlier problems upon later ones in itself leads to a correlation between 
scores on problems. This is particularly important with children or 
with adults under strange conditions. Specific information contained 
in or brought out in earlier problems is sometimes used in later problems 
(I believe this is more likely to happen in geometry than in other 
tests). Conditions of light, heat, sympathy with examiner’s methods, 
groans, smiles, sighs or chuckles of one’s neighbors, etc. operate sub- 
stantially the same on odd as on even examples. For these reasons I 
am inclined to think an r;; obtained by the split test method in the 
case of College Entrance Examination Board’s examinations in 
geometry and algebra would have involved in it greater c factors than b 
factors and accordingly be a trifle too high. I knowof no better simple 
way of securing an estimate of the reliability of a college entrance test 
than to split it into halves and use the Spearman-Brown Formula and 
though there are hazards in doing this I certainly think that such an 
estimate is very much better than none at all. 

I have thus far said little in regard to unequal standard deviations 
of tests, a point that Dr. Crum treats at some length. The reliability 
of a score is not a function of the specific units of measurement used; 
it is rather a function of the relative variability of the chance and unique 
factors as compared to the variability of the other factors contributing 
to the total score. I can make this point clear by citing the scores of a 
given group on both the Thorndike and the Ayres Handwriting Scales. 
Let us call the first x, and the second z2 scores. The range of scores in 
the first instance is from 4 to 18 and in the second from 20 to 100. 
No requirement that, 2x?,; equal 2x”, is met. Nevertheless, these two 
scales have been shown to measure almost exactly the same thing, and 
to do it with substantially equal accuracy. In fact, the correlation 
between them is a very good measure of the reliability of either. I cite 
this that the issue may not be confused by a discussion of gross varia- 
bility of scores, or by a comparison of means on two tests. I would not 
say that Dr. Crum has so confused the issue, but it requires a very 
careful reading of his discussion not to be led into this way of thinking. 
The issue of importance is whether that part of the variability in the 
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first test due to a certain function is the same proportion of the total 
variability as the part of the variability in the second test which is due 
to the same function. In calculating reliability by correlating two 
tests, gross variability is not an issue. 

In using the Spearman-Brown Formula to estimate reliability of 
a single test, the situation is less simple. (a) If the two halves are in 
truth measures of the same function and equally reliable and equally 
variable, the reliability of the sum or average of the two is exactly 
given by the Spearman-Brown Formula. (b) If they are in truth 
measures of the same function but unequally reliable and unequally 
variable but not radically different in these respects, the reliability 
of the sum will be given by the Spearman-Brown Formula to 
a remarkably close approximation. [See below for proof under 
condition (6)]. 

So far as variability is concerned (the existence of b factors is not 
the issue here) this condition (b) seems to me to cover the common 
situation, thus the r;, found by the Spearman-Brown Formula is a 
very good estimate of the reliability of the entire test. For this 
reason I do not think Dr. Crum’s criticism of Dr. Wood for failure to 
provide standard deviations of the half tests is fully justified. Asa 
general policy it would be desirable to have these, but I do not think 
knowledge of them would ordinarily throw light upon the question 
of whether the Spearman-Brown Formula was applicable or not. 

I here go very much further than does Dr. Crum. He implies that 
if certain conditions which he lays down are met, or if we can so split 
our test as to secure halves that meet these conditions, we have found 
a warrant for the use of the Spearman-Brown Formula. The thing is 
not as simple as this. If every condition stipulated by Dr. Crum is 
met we still may not have a serviceable reliability coefficient for we 
would know practically nothing as to the relative magnitudes of what 
I have called the a, b, and c, factors. 

I must, in fact, object to the view that we have any right to search 
for such a division of the test into parts as will meet certain mathemati- 
cal conditions. Chance is always a large factor in a score, and if we 
use it to our advantage we can meet very rigorous conditions, giving 
us @ spurious sense of certainty. Splitting a test into odds and evens 
is certainly unprejudiced by a posteriori demands, nor is there any 
endeavor by such a procedure to capitalize chance. It is a chance 
division so far as the chance factors are concerned, and that is the 
main desideratum. 
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Though I believe Dr. Crum has overstated the difficultly of securing 
a reliability coefficient in his treatment of variability, I, at the same 
time, think that he has understated it in his laying down of ‘sufficient’ 
conditions. I am in entire agreement with him that there are factors 
which are unique to certain questions and to certain tests, but I do not 
consider our failure to measure them in a reliability coefficient a 
disadvantage. I am, of course, also in agreement with him that there 
commonly are factors (6 factors) related to the general field and found in 
the first test (or half test) and not found in the second test (or half test) 
and that our failure to measure these in a reliability coefficient is a 
disadvantage, which at times may be serious. 

It seems to me that Dr. Wood’s report, Dr. Crum’s article, and the 
new College Entrance Examination Board’s Report are all steps in 
the right direction; that they will contribute to establishing the con- 
viction that every test measure has an intrinsic degree of reliability; 
that this is the second most important fact to know about the measure 
(the first being what it is a measure of); and that there are difficulties, 
a few of them statistical but most of them experimental, in the way 
of determining this reliability. 


Proor oF APPLICABILITY OF SPEARMAN-BROWN FORMULA UNDER 
ConplITION (6) 


The proof of this is as follows: Let z; and 22 be the halves of the first 
test and z; and 2x, the halves of a similar second test. Then the true 
reliability of the first test is r(142)(s44) and the reliability given by the 
Spearman-Brown Formula is r;:, which is equal to2ri,/(1 + 112). Let 


%i1=—a + e; 
and %2 = (1 + A)a + (1 + 5)es 


in which o., = oe, =\o., 2.€., any difference in the gross standard 
deviations of the chance factors is allowed for by the magnitude 6 
just as any difference in the gross standard deviations of the ability 
factors is allowed for by the magnitude A. Further, we will suppose 
that A and 6 are not large so that A’, Ad, and 5? are terms negligible in 
comparison with A and 6 terms. Then we have 


a7, = 0% + 0% 
o%, = (1 + A)’o%, + (1 + 5)%0?, 
13 pa (1+A)o*, 
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This value may be substituted for riz in the r;; formula, giving after 


expansion, reduction and discarding the second and higher power terms 
in the deltas, 


Fa ex: 4 (A — 6)20%,07, _ 
ru = 20? + 0, (207, + ¢?,)? y 





iFealt + @— a(1- 777.) 


If r(142)(s44) 18 expanded in terms of the elementary coefficients 1, 


T13, 714, 723, T24, ANd 734 and the standard deviations, o1, 02, 03, and 4, 
first noting, since 


3 =artes 
and m%=(l+Ajat+e, 


that o1 = 03; 02 = o4; and that riz. = rig = 123 = 734. We will obtain, 
after several reductions, the same value as that found for ri. 
Accordingly, 


Tit = 1(142)(8+4) 


This condition holds when second power terms in the deltas are 
negligible in comparison with first power terms. It does in fact 
approximately hold for a wide range of variability as may be proven 
by comparing the second power terms in the expansions, as below. 
To simplify the treatment, let us consider the case when A and 6 are 
equal. We have 


%=ate, 

Zo = (1 + A)(a + e2) 

3 =~artes 

a, = (1 + A)(a + e) 
Oe, = Oey = Oe, = Ce, = Ce 
2r ie hs 207, 


— 1 + Trio bi 207, + co’, 





We can find after a number of reductions that: 
207, 207, A 2 o”, 
Pasa (s+o = 207, +o, on 207, + =| (573) ox a =<) 


- ($a) Garta) +--+] 
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This [ ] is a convergent series for all values of Aif 713 is greater than zero, 
and it is a very rapidly convergent series for all ordinary situations. 
Consider the extreme case where the reliability r1; = .5 and A = 1.0. 
These, I believe, are more extreme conditions than any for which the 
Spearman-Brown Formula has ever been used. We then have 
ie 3 
1+.5 3 
1 1 1 


rasmoso = 3 ~ gaz — 729+ goog -- - | = 043 


_ Thus in this very extreme situation there is only a 34 per cent error 


in the value of 71, when derived from the Spearman-Brown Formula. 





The paragraphs of Professor Crum’s article referred to on page 
193 are as follows: 


The essential point is that we can not hope to establish by a priori considera- 
tions that such a test, covering a range of capacities, will even approximately 
fulfill the conditions which we impose. There is indeed a remote possibility 
that it may happen to satisfy those conditions, and a slightly greater possibility 
that it may satisfy the fundamental necessary condition; but such possibility 
certainly gives no warrant for assuming that the conditions are met. In other 
words, it does not seem unfair to say that there can be no justification whatever 
for using Brown’s formula in the study of reliability of a test which covers a 
range of specific topics. It is on such grounds that the analysis in the Wood 
Report appears to be unsound; and the conclusions based on such analysis, in 
particular the conclusion in paragraph 2 on page 14, have not been substantiated. 

The case is somewhat different with a test designed to measure a single capac- 
ity, such as a section of a standard mental test. Here it does not seem impossible 
to design a test which shall fulfill the requirements (8). The test really consists 
of a series of similar observations of the same magnitude: The various questions 
are the instruments of measurement. Clearly, in the nature of the case, it is 
desirable to have the individual standard deviations all approximately equal and 
to have a uniformly high correlation of the individual questions with each other. 
Care and practice will doubtless enable the examiner to insure to a high degree 
the realization of this ideal. With such a test, it is not difficult to believe that 
the conditions (8) might be satisfied at least approximately. All this is, however, 
robbed of much of its importance, when we remark that such a test is after all 
but a composite of two similar tests given in immediate succession. If we can 
make the two halves of such a test have high correlation with each other, we 
should be able to make an entire new test which should have high correlation 
with the first. Thus the practical problem of estimating reliability is of relatively 
little moment in the case of a test intended to measure a single capacity. 
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A STUDY OF WRITING ABILITY AND ITS RELATION 
TO OTHER ABILITIES BASED ON REPEATED 
TESTS DURING A PERIOD OF 20 MONTHS: 


ARTHUR I. GATES AND JESSIE LASALLE 
Teachers College 


In another article,? one of the writers reported the development 
of a formula by means of which a score for general writing ability in 
which speed and quality are combined. The formula is 


Combined score = Quality x ~/Speed 


Quality is stated in terms of steps on the Thorndike Writing Scale 
and Speed is the number of letters written per minute. In the earlier 
article the validating evidence, the theoretical importance and the 
practical services of the combined score are presented. In the present 
study the combined scores are utilized to express writing ability. 

A two-minute test in writing was given following a preliminary 
“warm-up.” Four tests were given at intervals of 4 months over a 
20-month period beginning October, 1920 and ending in May 1922, 
embracing two full school years. These tests were for: (1) Speed of 
reading by means of the Courtis or Burgess Tests; (2) comprehension 
in reading by the Thorndike-McCall Tests; (3) spelling ability by means 
of lists of 50 words from the Ayres-Buckingham Scale; (4) arithmetical 
ability by the Woody Tests. At an interval of approximately a year, 
the Stanford-Binet and the National Intelligence Tests were also 
given. The correlations to be analyzed were based upon 78 children, 
members of Grades III, IV, V and VI in 1920, who completed all of 
the tests during this period. Pearson’s Product-Moment Formula 
was used throughout. 


THE INFLUENCE OF AN INTERVAL OF TIME, FROM 0 To 20 MonrTus, 
BETWEEN TESTS UPON THE CORRELATIONS 


The first question concerns the influence of an interval of,time 
between tests. How well will a test of writing given today predict 
abilities 4, 8, 12 or more months hence? Our data provide several 
comparisons of correlations obtained at different intervals. They 
are given in Table I. 





1 The data were secured in the Scarborough School, Scarborough, New York, 
during the period between Sept., 1920 and June, 1922. 
2 Journal of Educational Psychology, March, 1924. 
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TaBLeE I.—Tue CORRELATIONS BETWEEN WRITING TEsTs AT DIFFERENT 











INTERVALS 
Intervals in months 4 8 12 16 20 
.88 
.91 | .84 
NE ees ere { .94 | .92 .86 


.85 | .88 .90 . 84 








a os kk dn nnd aed he Ome .896| .883 | .877 | .855 | .850 
I PEI sn 6 i's wei vee das .010| .011 | .012 | .015 | .022 
3. Differences from preceding r....|......| .013 | .006 | .022 | .005 | average 


.0115 























Table I contains evidence of three facts of importance: (1) Writing 
ability predicts itself over a period of 20 months or less with a marked 
degree of accuracy; (2) the correlations become gradually lower as 
the intervals become wider! and (3) the decreases in the correlations 
are probably uniform, the variations in the differences (line (3)) 
being due to chance. 

The magnitude of these correlations will be more significant when 
compared to those obtained with other abilities based on the same 
group of pupils. Since coefficients actually obtained in writing (above) 
and in other subjects depend, however, not only upon the associations 
which exist between the functions compared but also upon the relia- 
bility of the instruments, the real relationships are disclosed only by 
coefficients corrected for attenuation, due to the unreliability of the 
measures. In some cases the “reliability coefficients’? upon which 
the correction formulas are based could not be computed directly 
since tests were not repeated within a period of less than four months. 
Our procedure in all cases, because of these exceptions, was to estimate 
the reliability coefficient from the tendency of the r’sat4, 8, . . . 
20 months. Thus for writing (see Table I) we added to the average 
correlation at a four-month interval (0.896) the decrease due to a four- 
month interval on the average, 7z.e., .0115. The result is 0.9075 


1 A similar result with other functions was reported by Gates, G. S.: Individual 
Differences as Affected by Practice. Archives of Psychology, Vol. LVIII, 1922. 
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(91)—which is accepted as the reliability coefficient. Others were 
estimated in a similar manner with results as shown in Table II. 


TasLe II.—RELIABILITY COEFFICIENTS OR SELF-CORRELATION WHEN INTERVAL 
or Time 1s NEGLIGIBLE 


ES EES ne a RE” SE ee a ee 0.91 
ETERS OEE De CE DS LAE Sen SM 0.80 
CU cc icccccscevesessatereceeeee 0.83 
REA das SAG bith USE CENEHS Fb ave hss chbu eaSben dee webs 0.90 
LL 6 de SOk «5 ii'e Abad o he NENW as Rs eens bmn 0.93 


By means of a simple formula,’ from the average correlations 
between writing ability in Table I the coefficients may be derived 
that would be found were the test scores perfectly reliable. The 
correlations thus corrected for attenuation are given in Table III. 


TaBLeE III.—CoRRELATION BETWEEN WRITING ABILITY (r’s oF TaBLE IV Cor- 
RECTED FOR ATTENUATION) AT INTERVALS GIVEN 





Intervals in months 0 4 8 12 16 20 





Corrected coefficient............... 1.00 | .983 | .970 | .963 | .939 | .934 























The correlations between tests which yield the pupil’s true writing 
ability are very high even after an interval of two school years, 
although the tendency of the r’s to decrease with increases in the 
intervals is appreciable. The full significance of these correlations 
will be clarified by comparisons with other correlations and by certain 
further corrections, notably by corrections for the wide range in age 
among the pupils measured. 


CORRELATIONS OF WRITING WITH OTHER ScHOLASTIC ABILITIES 


In Table IV are given the individual correlations between writing 
and reading rate, reading comprehension, arithmetic, and spelling 


. : z Ti2 
et SI 1 Foe liability Coefficients liability Coefficients, 
of 1. of 2. 
in which 7... means the corrected correlation between variables 1 and 2. See 
Spearman, C: Demonstration of Formulas for True Measurement of Correlation. 
American Journal of Psychology. 
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grouped according to the intervals between the tests. At the foot 
of each column is given the average correlation, followed in the next 
line by the average coefficient corrected for unreliability of the tests. 
At the end of the table a summary of the average coefficients corrected 
for attenuation is given together with the corrected correlations of 
writing with itself. 

All of the average correlations with writing decrease as the interval 
between tests becomes longer with a rate that, on the whole, is uniform. 
No scholastic test predicts writing ability as well as a writing test 
predicts itself. The differences are large and significant. 

All of the correlations are relatively high because the range of ages, 
and consequently of abilities is wide, the group comprising four grades. 
Better notions of the significance of the correlations may be obtained 
by rendering age constant by means of partial correlations. To 
secure the most reliable results and, incidentally, to save a great deal 
of labor, the averages of all correlations between each pair of tests, 
and each test with age, have been computed and the partial correlations 
based upon them. The average inter-correlations are given in Table 
VY. The resulting partial correlations are given in Table VI. 


TaBLE I1V.—SHOWING CORRELATIONS OF WRITING WITH OTHER ABILITIES BEFORE 
AND AFTER CORRECTION FOR UNRELIABILITY OF TESTS 
Intervals between Tests in Months 




















0 4 8 | 12 16 20 
| 
| Writing with Arithmetic 
.74 .74 
81 .74 
.72 65 .67 
.74 .72 .70 
.67 .72 .76 .67 
. .70 .63 .65 
.78 .70 .69 .67 
.69 .73 .68 .66 
71 .68 .72 .65 .59 
.69 .73 .69 .65 .67 
Meanr_— .735 .714 .700 .683 .658 .630 
Corrected .800 .776 . 760 .742 .715 .685 


























RE 







































































A Study of Writing Ability 209 
Writing with Spelling 
.77 
.68 
71 .74 
.74 .67 
.74 71 .66 .60 
71 .69 71 .65 
.74 .68 .65 .70 .72 
.72 71 71 .69 .67 
.69 va .72 .70 .69 .70 
.72 .70 .69 .70 .63 .58 
Meanr .720 .710 .694 .673 .677 .64 
Corrected .791 .780 . 762 .740 .744 .703 
Writing with Reading Rate 
.70 
.72 
.70 .66 
.62 71 
.73 .59 .64 .65 
.70 .73 . 56 .62 
.68 .66 .72 .52 .59 
.67 .67 .69 .73 .49 
.63 .64 .65 .69 .69 .56 
.59 .65 .67 .67 .66 .62 
Meanfr .666 668 .663 .645 .610 .59 
Corrected .783 .786 .780 .758 .717 .694 
Writing with Thorndike—McCall Reading 
. 64 
.68 
.62 .63 
.53 .58 
.65 .50 57 .53 
.70 -.67 .53 .58 
.64 .66 .67 55 .50 
.65 .63 .65 .64 .58 
.56 55 .61 .61 .62 .53 
. 54 .59 .59 .65 .58 .60 
Meanr  .623 .607 .604 .593 .570 . 565 
Corrected .716 .700 .694 .681 .655 .650 
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Summary and Comparison with Self-correlation—r’s Corrected for Unreliability 
of Measures 





Writing with writing............... 1.00 | .983 | .970 | .963 | .939 | .934 
Arithmetic with writing............. .800| .776 | .760 | .742 | .715 | .685 
Spelling with writing.............. .| .791) .780 | .762 | .740 | .744] .703 
Reading rate with writing........... .783| .786 | .780 | .758 | .717 | .694 
Reading composition with writing....; .716) .700 | .694 | .681 | .655 | .650 























Taste V.—THE MEAN CORRELATIONS OF THE FUNCTIONS WITH AGE AND WITH 
Eacu Orser; ALL CoRRECTED FOR UNRELIABILITY OF THE MEASURES 


Average of 67’s, writing with age! = 0.82 
Average of 67’s, arithmetic with age = 0.72 
Average of 671’s, spelling with age = 0.68 
Average of 671’s reading rate with age = 0.67 
Average of 61’s, reading comprehension with age = 0.67 
Average of 15 r’s writing with writing = 0.97 
Average of 36 r’s writing with spelling = 0.76 
Average of 36 r’s, writing with arithmetic = 0.76 
Average of 36 r’s, writing with reading rate = 0.77 


Average of 36 r’s, writing with reading composition = 0.70 


TaBLE VI.—THE CORRELATIONS BETWEEN WRITING AND OTHER VARIABLES WHEN 
Ace Is ELIMINATED BY PARTIAL CORRELATION 


Writing with writing = 0.91 
Writing with arithmetic = 0.43 
Writing with spelling 0.47 
Writing with reading rate = 0.51 


Writing with reading composition = 0.35 


With age rendered constant, or eliminated as a factor, the correla- 
tion of writing with itself is very high (0.91). This figure represents 
roughly the correlation between two perfect tests at intervals of 
about eight months of children of identical ages. The r will be higher 
at four months and lower at 20. It indicates a substantial tendency 
for individuals to maintain their relative positions in this ability. 
How this fact should be explained will be considered in later sections. 

The partial correlations of reading rate, spelling, arithmetic, and 
reading comprehension with writing are positive but small. The 
correlations indicate that these several tests embrace factors common 
to writing or at least, that the pairs of tests correlated are influenced 
jointly and similarly by some other factor or factors. 


1 In computing the corrections of the r’s between tests and age, the reliability 
coefficient of age is taken as 1.00, t.e., age is considered to be measured perfectly. 
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CORRELATIONS OF WRITING WITH INTELLIGENCE 


Conceivably, general intelligence is the factor which produces 
these positive associations. This hypothesis may be readily tested 
since two Stanford-Binet and two National Intelligence Tests were 
given during the two years. Each of the six writing tests was corre- 
lated with each of the two measures by both tests of intelligence, and 
all of the tests for writing and intelligence correlated with age. The 
average of each series of coefficients which give the most reliable 
measures, was then computed, corrected for unreliability of the tests 
(the reliability coefficients of MA and N.I.T. being each 0.93) and age 
eliminated, as before. The corrected average r’s together with the 
partial r’s are given in Table VII. 


Taste VII.—Tse AveraGe CoRRELATIONS CORRECTED FOR UNRELIABILITY 
OF THE MEASURES AND THE PARTIAL CORRELATIONS, AGE ELIMINATED 


Mean of 12 r’s, writing with Binet MA = 0.55 
Mean of 67’s, writing with age = 0.82 
Mean of 67’s, Binet MA with age = 0.73 
Mean of 12 r’s, writing with N.I.T. = 0.80 
Mean of 67’s, N.L.T. with age = 0.75 


Partial r, writing with Binet MA (age constant) = 0.01 
Partial r, writing with N.I.T. (age constant) = 0.48 


The data of Table VII do not yield an unequivocal interpretation 
of the association of intelligence and writing ability, the result for the 
Stanford Binet differing from that for the N.I.T. These two tests, 
as is generally known, do not measure identical abilities although they 
are highly correlated. Writing ability is entirely unrelated to Stanford 
MA, when age is eliminated but is correlated with scores on N.I.T. 
to the extent r = 0.48. 

Precisely how these results are to be explained may not be fully 
disclosed by our data but certain hypothesis may nevertheless be 
partly verified or refuted. 

The correlation of Writing and Binet Scores is not unusual; approxi- 
mately the same results have been found before. Burt, however, 
found a positive correlation between writing and mental age among 
children of inferior intelligence (r = 0.32)! and a lower correlation 
among unselected children (r = 0.21)? based on children in grade 

1 Burt, Cyril: “The Distribution and Relations of Educational Abilities.’ 


London: P. S. King, 1917, p. 64. 
2 Ibid. ‘Mental and Scholastic Tests.” London: P.S. King, 1921, p. 183-184. 
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groups. Probably both of these figures would be reduced somewhat, 
by the elimination of age. The groups used in the present study 
were neither dull nor unselected; they were selected, bright children. 
Few were below 100 IQ; the average was 1161Q. It is not improbable 
that the correlation of intelligence as measured by the Binet and 
Writing ability is slight and positive in the lower end of the distribu- 
tion of mental ability, zero in the upper half and, as a consequence, 
slightly positive for the whole range. This hypothesis would, at any 
rate, form a simple explanation of the discrepancies between the results 
of Burt and those of the present study. 

For the correlation of 0.48 between writing and N.I.T. scores there 
are at least two types of conceivable explanations: (1) writing ability— 
particularly speed—may be a factor in the success of taking the N.L.T.., 
i.e., to the extent indicated by an r of 0.48 the latter may be actually 
a test of writing; (2) both tests may reflect to a degree a common 
general factor, not primarily intellectual but educational. 

The first possibility is always worthy of consideration in appraising 
any test. Courtis and Thorndike! have shown the pronounced 
influence of writing facility in taking certain tests; indeed, they found 
in one instance that a test, supposedly measuring ability to add, was 
in reality a measure of speed of writing. It should be observed that 
the correlation of writing with N.I.T., age eliminated, is similar in 
magnitude to the correlations of writing with the educational tests, 
viz., writing with N.I.T. 0.48; with reading rate 0.51; with spelling 
0.47; with arithmetic 0.43; with reading comprehension 0.35. In 
all of these tests the responses are written. But that writing is solely 
responsible for the correlations is very doubtful inasmuch as in tests 
of reading comprehension, arithmetic and spelling which for our 
pupils provided a surplus of time for writing, the correlations are nearly 
as high as in the others in which time was limited. Of all the tests, 
however, these three—comprehension, arithmetic and spelling—show 
the smallest correlations with writing. If we assume (since writing 
is almost certainly not a factor contributing to success in these tests) 
that they represent factors other than mere quality and speed of writ- 
ing which may be present in all of the tests, we may, by subtraction, 
estimate the degree to which writing is an influence in the other tests. 
Thus, the average coefficient of spelling, arithmetic, and comprehension 
with writing is 0.416; whereas the correlation of N.I.T. with writing 


1 Courtis, S. A. and Thorndike, E. L.: Correction Formulas for Addition Tests. 
Teachers College Record, January, 1920. 
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is 0.48 and rate with writing is 0.51. The differences! between .48, 
also between .51 and .416, may be taken as very rough measures of 
the influence of writing upon achievement in the N.I.T., and the Read- 
ing Rate Test, respectively. 

Mere efficiency in writing, then, exerts a slight influence on the 
correlations between writing and N.I.T., and writing and reading 
rate as tested.?, What are the causes of the remaining correlations 
(averaging 0.416 between writing and each of the several tests) which 
it is not reasonable to suppose could have been caused by writing, 
per se? 

Probably this amount of association is due to certain educational 
factors, of which there are three types. (1) Differential advantages 
in general educational opportunity and instruction would to some 
extent produce correlation among the tests. If some students were 
given more attention, were favored by teachers more than others, 
or if some enjoyed more and better coaching at home or elsewhere in 
several or all of these subjects, the result would be a tendency toward 
positive correlation of the tests. (2) Other things (amount of instuc- 
tion, general intelligence, etc.) being equal, those by temperament, 
training, or accident more interested in school work, more diligent, 
more eager to achieve or excel, would tend to do better in all of the 
tests than those less favorably disposed in attitude and endeavor, not 
only because of their more vigorous attack on the tests themselves, 
but also because of the cumulative effect of such attitudes during their 
whole scholastic careers. Such differences would tend to produce 
positive correlations among the tests. (3) Differences in experience 


1 It is improper to subtract these correlations, (thus r .51 — r .42 = .09) since 
the significance of a difference between two coefficients depends upon the size of 
the r’s from which the difference was obtained, e.g., the difference between r .80 
and r .70 is much greater than a difference between .40 and .30, in terms of the 
amount ‘that the error of estimate would be reduced. This significance of an 
r is expressed by the coefficient of alienation k. K = +4/1 —r?. Whenr = .416, 
the error of estimate is reduced .09 of the way between mere chance and perfect 
estimation; an r of .48 reduces it .12 and an r of .51 reduces it .14 of this distance. 
Roughly, the difference between r .48 and .415 is equal to the difference between r 
of .00 and .32; the difference between an r of .51 and .415 is equal to the difference 
between r’s of .00 and .30. 

2 Elsewhere one of the writers has presented evidence that writing, or writing 
and drawing, is an appreciable factor in the Burgess Reading Test. See A Study 


of Reading and Reading Tests. Journal of Education Psychology, Sept., Oct., 
Nov. 1921. 
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in taking tests, and differences in capacity to improve in the technique 
of taking tests, would tend to produce a correlation. It is well known 
that in most educational and intelligence groups tests, practice results 
in improvement and that a technique of taking tests is acquired more 
readily by some than by others. Such was demonstrably the case, 
as elsewhere indicated,! with the subjects of the present investigation. 
That a certain amount of this technique acquired in one test transfers 
to others and thus tends to increase the correlation among all has 
also been elsewhere demonstrated.’ 

These three types of educational factors cannot be readily distin- 
guished or appraised in the present results. Each probably has 
some effect—we are inclined to believe that the third is the most potent 
—and altogether they probably account for all the correlation between 
writing and the other tests except those amounts earlier attributed to 
the common influence of mere efficiency in writing. The reasons for 
this assumption are as follows: There is no correlation when age is 
eliminated between writing and the Stanford MA. In the latter 
no writing is involved, and the general technique of taking written 
group tests would probably have little or no effect upon it. Further- 
more differences in educational achievement, however occasioned, have 
no appreciable influence in the present instance.* The correlations are 
lowest between writing and other tests upon which educational advan- 
tages and attitudes and speed and efficiency in taking tests probably 
have the least effect. The lowest correlation, 0.35, is between writing 
and Thorndike—McCall, a test for which there was, among the present 
subjects, a surplus of time and a test which measures abilities not 
readily trained specifically. The arithmetic and spelling tests, which 
form a middle group, were scarcely time tests but both abilities are 
more susceptible to specific acquistition. N.I.T and rate of 
reading are both time tests, and each, especially the latter which 
gave the highest r with writing, may be deliberately and markedly 
influenced by practice either on the tests themselves or on similar 
material. 


1 Gates, A. I.: The Unreliability of MA and IQ Based on Group Tests of 
General Mental Ability. Journal Applied Psychology, March, 1923. 

Gates, A. I. and Van Alstyne, Dorothy: General and Specific Effeets of 
Reading, Teachers College Record, March, 1924. 

* The evidence for the last statement will be found, in The Relative Predictive 
Values of Certain Intelligence and Educational Tests. Journal of Educational 
Psychology, Dec 1923. This statement is not meant to be general; it applies 
only to the subjects and corrections found in the present study. 
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The conclusion to which these considerations point is that, making 
allowances for differences in age, general ability in taking group tests, 
general educational advantages, zeal to achieve, and also in certain 
tests—reading rate and N.I.T.—of speed of writing per se, writing ability 
is correlated slightly, if at all with intelligence or with ability in reading, 
spelling, and arithmetic among children of average or better than 
average intelligence. With a range of intelligence reaching lower 
levels a small positive correlation would probably be found; with an 
unselected group, a very slight positive correlation. 

The mean of the correlations of the various writing tests with 
each other which are comparable with the other correlations just 
observed, having the same average interval of time between them 
(about eight months), based on the same pupils, with results corrected 
for unreliability and age eliminated, was 0.91. This figure is much 
higher than the correlations of writing with other tests; correlations 
which have ascribed to the influence of general ability to take tests, 


educational advantages, and educational zeal. The amounts of these 
correlations were: 


Average 36 r’s writing with arithmetic = 0.43 
Average 36 r’s, writing with spelling = 0.47 
Average 36 r’s, writing with comprehension = 0.35 

Average 0.416 
Average of 15 r’s, writing with writing 0.91 


Part of the correlation, 0.91, between tests in writing is due to these 
same educational factors, since they will increase correlations between 
tests of writing as well as between writing and other tests. To ascer- 
tain, roughly, how well writing correlates with itself when these educa- 
tional influences have been removed, we may take the average r of 
0.416 (above) as the approximate measure of the educational influences, 
and by application of the formula for partial correlation, compute the 
desired residual. It is: 
Correlation of writing with writing (educational influences eliminated) 
= 0.891 
The residual is still high. This coefficient represents the degree of 
association between writing ability with an interval of approximately 
eight months between tests. Since the influences of intelligence, 
age, general efficiency in taking tests, and the general influence of 
education and of zeal in learning have all been eliminated, the coefficient 
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probably represents the influence of a special native aptitude or capac- 
ity for achievement in this type of motor activity. Among children 
of good intelligence, reared in good homes and taught in a good school, 
innate capacity exerts an influence upon achievement in writing in 
comparison with which all other factors combined appear dwarfed 
and feeble. 


SUMMARY 


1. Writing ability, under the conditions existing in the school 
studied, is mainly determined by a special native capacity or aptitude. 
Over a period of 20 months, a test in writing predicts future ability 
very closely, even when allowances are made for such factors as general 
intelligence, age, general training in taking tests, general educational 
advantages and general educational zeal, which might produce spurious 
correlations. With all of these factors eliminated the correlation be- 
tween writing ability now and eight months hence will be about 0.89. 

2. Writing ability, like all other scholastic abilities, parallels age 
closely when other influences are equal. It shows very low positive 
correlations with intelligence when age and other factors are eliminated 
within unselected groups. For groups of average and superior children 
the association with intelligence is approximately zero. The same is 
true of the association of writing with ability in arithmetic, reading, 
and spelling, when the sources of spurious correlations are eliminated. 

3. In raw results, writing exerts an influence on the scores of rate 
tests, such as the N.I.T. and Burgess Reading Test. 

4. Spurious correlations between pencil-and-paper tests may be 
produced by (1) practice in taking such tests in which a general 
technique is acquired, (2) differences in general educational opportuni- 
ties, incentives or pressure at home or at school and (3) differences in 
educational attitude, zeal, determination to excel and the like. 
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RELIABILITY OF THE STANFORD AND THE 
HERRING REVISION OF THE 
BINET-SIMON TESTS 


JOHN P. HERRING 


Director, Bureau of Research, New Jersey Department of Institutions and 
Agencies 


The tables and graphs presented herewith concern primarily the 
reliability of two revisions of the Binet-Simon Tests, the Stanford and 
the Herring. They show that both the Stanford and the Herring are 
very reliable as compared with other instruments for the measurement 
of intelligence.| 

Since, however, the Herring was made definitely as an alternative 
form of the Stanford, and since, owing to high correlation between 
the two, it may according to accepted convention be considered as such, 
the same correlations are interpreted also as validity coefficients. The 
tables and graphs therefore concern validity as well as reliability. 
Since the Stanford Revision was assumed as the criterion in the con- 
struction of the Herring Revision, whatever strengths and weaknesses 
from the point of view of validity exist in the Stanford have presumably 
to a large extent been included in the Herring. That there is room for 
some difference in the meaning of the mental ages obtained between 
these two tests is evident, in view of the coefficient of alienation, 
k = +/1 — P, which is .16, despite the correlations of .98 and .99 
between the two. This means that the errors made in estimating one 
test from the other have been reduced only to about .16 of their magni- 
tude as they would be obtained by random guessing. 

The following tables exhibit various quantitative statements of the 
reliability of the Stanford Revision and of the Herring Revision: 


VALIDITY-RELIABILITY COEFFICIENTS 





oS.H PEs.# oH.S PEg.s | Average 
Group n r os | oH differ- 


Vi=-r | Vi-r? | Vi-r? | Vi-r? | ence 





1 MA | 72/.9781|32.32/31.99| 6.73 © 4.54 6.66 4.49 5.38 
2 IQ | 82}.9908)26.01/26.33) 3.52 2.37 3.56 2.40 2.84 
2 MA | 82)|.9877|33.00)32.27 5.16 3.48 5.04 3.40 4.13 
3 MA /116).9870/25.11/25.22) 4.03 2.72 4.05 2.73 3.24 
All cases.| MA /|270).9845)/35.32/36.04 6.20 4.19 6.32 4.26 5.04 
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Quis Nr o28.2H PE2s.28# o2H.28 PEs#.2s 
n fh 
i+W- iy Vi-r3 Vi-r Vi-r? Vi-rt 
1 MA 72 . 9889 4.81 3.25 4.76 3.21 
2 IQ 82 .9954 2.50 1.68 2.53 1.71 
2 MA | 82 .9938 3.68 2.48 3.60 2.43 
3 MA | 116 .9935 2.86 1.93 2.88 1.94 
All cases...| MA | 270 .9922 4.40 2.97 4.49 3.03 


























MA Mental age in months. 
IQ Intelligence Quotient in hundredths. 
r Pearson Product-moment coefficients of correlation of the Herring with the Stan- 
ford Revisions of the Binet-Simon Tests. 
es Standard Deviation of mental ages in months of the Stanford Revision of the 
Binet-Simon Tests. 
eH Standard Deviation of mental ages in months of the Herring Revision of the Binet- 
Simon Tests. 
es.m8°/1—r? Standard Error of Estimate in mental ages in months or IQ’s in hundredths ia 
predicting the Stanford from the Herring. 
PE .67445¢ 
@8.8*</ 1—r?: Standard Error of Estimate in mental ages in months or IQ’s in hundredths in 
- predicting the Herring from the Stanford. 
a 


i+ W- lr Spearman formula for prophesying reliability of N times as many tests. (NW 


equals 2.) The magnitudes in this column do not differ from those for ri0 = +/riz" 
When r = .99 and N is 2, Spearman's formula gives .9949749, while +/riy = 
.9949874; when r = .95 and N is 2, Spearman’s formula gives .9746794, while 
Vrir = 9743590. 

28.28°/1—r? Standard Error of Estimate in mental ages in months in predicting the result of 
two Stanford Revisions from the use of two Herring Revisions. | 





The groups represented in the tables and graphs are here described: 

Group III, described first because it is the most interesting, com- 
prises all the 12-year-old children, except two or three, 116 in number, 
in the public schools of Bloomsburg, Pennsylvania. All of these 
subjects were more than 12 years old and less than 13 on May 1, 1922 
and all were examined by Mrs. Marjorie H. Wilner. The class interval 
used in computing all of the constants was one month of mental age. 
For Stanford mental ages. 


B, = 0.1557 + 0.1303 
and B. = 3.2136 + 0.5122 


Since these constants do not differ significantly from the Gaussian 
6, = 0 and f, = 3, they do not indicate the presence either of skewing 
or of kurtosis except in small amounts due to chance. The SD of 
mental ages, 25 months, is in my experience not unusual for 12-year- 
old age groups. The group, therefore, probably differs only slightly 
and by chance from an unselected age group. The data of this group 
were not used in determining mental age equivalents. 
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Group I comprises the first 72 cases which were examined while 
the tests were still under modification. It exhibits the lowest correla- 
tions and the greatest errors of estimate. In computing the correla- 
tions for these subjects the data were grouped so that the class intervals 
each included 5 points in IQ and five months in mental age. 

Group II was examined after the tests had ceased to be much modi- 
fied. It exhibits the highest correlations and the lowest errors of 


CorreLaTION Scatrer Dracram or STaANFoRD AND Herrina IQ’s IN 
HUNDREDTHS 
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estimate. With these subjects each class interval was either one point 
IQ or one month mental age, the data being to this degree ungrouped. 

Groups I and II taken together, 154 cases, were examined in part 
by the author of the Herring Revision and in part by other examiners. 
These 154 cases covered a very wide range of mental ages, of chronolog- 
ical ages, and of IQ’s, so that the constants of this group are not 
directly comparable with those of an unselected age group. This is 
apparent from the standard deviations of the mental ages and IQ’s 
shown in the table. These 154 cases were used in determining mental 
age equivalents. 
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The group labeled All Cases, 270 in number, includes Groups I, II, 
and III. Thecorrelations resulting are not averages but were computed 
from the data all treated in one group. 

The original data for Groups I and II are found in Herring (1924). 
The original data for Group III exist in Wilner (1923). 

Two of the groups are represented by means of photographed 
scatter diagrams. Group II was selected because it exhibits correla- 


CoRRELATION ScaTTER DIAGRAM OF STANFORD AND HERRING Menta.t Aaus 
In Montus 


Group Number 3 
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tions between intelligence quotients in hundredths, and because it 
exhibits the highest coefficient of correlation and the lowest errors of 
estimate. This group is subject however to the limitation that the 
data were used in standardization. The correlation may therefore 
be spuriously high. Group III was selected because, in the history of 
Binet testing, it is a comparatively large unselected age group, and 
because of other considerations indicated below. The class interval is 
one month of mental age. 

A number of statements may be made in further and summary inter- 
pretation of the data numerically and graphically presented: 
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1. Both Binet Revisions are very reliable instruments compared 
with other instruments for the measurement of intelligence. 

2. The Pearson Product-Moment Coefficient of Correlation which 
best represents the degree of mutual implication involved in the two 
revisions is probably .9870 between two mental ages obtained for each 
of 116 12-year-olds. 

This group is selected as representative because 

(a) The error of grouping is minimal; 

(b) The group was entirely of 12-year-olds; 

(c) All the 12-year-olds of the Bloomsburg public schools were 
included with the exception of two or three who were absent during 
the whole testing period; 

(d) None of the tests were given by the author, and all were given 
during a period when the examination itself remained unmodified; 

(e) The correlation and probable error of estimate are intermediate 
in magnitude among those presented; 

(f) 61, B2 and cy, indicate a Gaussian distribution of mental ages 

(g) The data were not used in the determination of mental ages. 
and the absence of selection among 12-year-olds. 

Even so high a correlation as .9870 corresponds unfortunately to a 
coefficient of alienation, k = .16, and therefore represents prediction 
still far from perfect, the errors of prediction being .16 as large as those 
of random guessing. The probable error of estimate for this group 
may be taken as 2.7 months of mental age whether the Herring is being 
estimated from the Stanford or vice versa. 

3. In unselected age groups the probable error of estimate of either 
the Stanford or the Herring Revision is likely, with competent testing, 
to be 2.7 months of mental age; of the Stanford plus the Herring 
Revision, the probable error of estimate would be about 2 months of 
mental age. 

4. It is useful to know what average difference corresponds to these 
estimates. The average difference of H (Herring) and S (Stanford) 


2HS 


is a function of the constants of the correlation formula r = . 
Noxgo* 


For 





o(n-s) = Na }? 
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a formula already familiar (Kelley ‘Statistical Method,’’ 1923) for 
the standard error of a difference. The average differences of the 
table were computed by means of the formula 


AD = .79788+/c,2 + 0,7 — 2r oz cy 


The average difference between the mental ages of one and the other 
revisions, with competent testing, is about three months. 

5. One learning to use either the Stanford or the Herring may feel 
satisfied of ordinary competency as a psychometrist when he can 

(a) Maintain a correlation of .97 between mental ages of the two 
in unselected age groups, or 

(b) Maintain an average difference between mental ages obtained 
by means of the two Binet Revisions of less than four mental months. 

Practice and study should, however, raise .97 to .98 and reduce 4 
to 3. Favorable circumstances and precise work will occasionally 
yield higher r and smaller AD. 

6. The existence of two independent Binet Revisions makes it 
possible in the measurement of intelligence in the elementary school to 
obtain unusually precise original measures instead of correcting for 
chance errors of measurement. . This is true partly because of the light 
thrown upon the reliability of each instrument and partly because of the 
provision now available for measuring individuals with two tests 
instead of one merely. When both revisions are used and the two 
resulting mental ages are averaged for each individual, the reliability 
correlation as between these two and two others may be prophesied 
as about .993 and the probable error of estimate about two months of 
mental age. The significance of the third decimal place in the case of 
certain high Pearson Product-Moment Coefficients of Correlation is 
indicated in the following table: 





r vV1—r? r Vi-r? 

1.000 0 - 989 . 148 
- 999 .045 : . 988 - 155 
-998 .063 . 987 .161 
.997 .077 . 986 . 167 
. 996 .089 . 985 .173 
-995 - 100 . 984 .178 
.994 .110 . 983 .184 
.993 .118 . 982 - 189 
. 992 . 126 -981 . 194 
.991 . 134 . 980 . 199 
. 990 .141 
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COMPARISON OF STANFORD AND HERRING BINET 
REVISIONS GIVEN TO FIRST GRADE CHILDREN 


GEORGE T. AVERY 
Associate Professor of Psychology, Colorado Agricultural College 


When the Herring Revision of the Binet Tests appeared in 1922 
it became desirable to compare it with the Stanford Revision. Inter- 
est had also been aroused by the high correlation figures (.97 to .99) 
published in the Herring manual. As such tests are of special value 
in the diagnosis of Grade I children and as other intelligence examina- 
tions, including the original Binet-Simon, have shown a weakness 
here, it was decided to give both the Stanford-Binet and the Herring- 
Binet Tests to the children of the Grade I rooms of the Homer school 
in Palo Alto, California. The cooperation of Superintendent A. C. 
Barker, and suggestions from Professor L. M. Terman and Dr. John P. 
Herring were much appreciated. The latter forwarded original data 

of his unpublished thesis. 
“The Herring Test is a revision of the Binet-Simon, and is an indi- 
\ vidual examination. As the author states, the questions are asked 


/ and answered orally for the greater part. The examination contains 


many tests similar in form to the Binet-Simon. The content, however, 
is almost entirely original. It is given in the form of a point scale, the 
final score being expressed in a certain number of points which are 
referred to tables of mental age equivalents. Each test consists of a 
short series of elements and the examination score is the sum of the 
scores from the separate tests. The method also provides for the 
obtaining of mental ages by use of less than the 38 tests. A mental age 
may be calculated by administering the tests in five groups. One of 
these, Group A, is designed for rapid testing; Group B, is designed for 
use when more time is permitted; Group C is suggested as of high 
reliability. When very exact results are desired, Group D or Group E 
should be completed, according to the author. Tables are given show- 
ing the mental age equivalents of each total score. Provision is made 
for the calculation of the intelligence quotient in the same manner as 
by the Stanford-Binet.! 

The tests were given by two examiners experienced in Stanford- 
Binet procedure; Mr. J. F. Walker and the writer, both connected 


1 Further discussion of the Herring revision may be found in Herring, John P.: 
“Herring Revision of Binet-Simon Tests. Form A.” World Book Co., 1922, 
pp. 3-6. Also see article by Dr. Herring in this issue of this journal. 
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with the graduate school of Stanford. The group of children examined 
was @ part of the same school system used in the original standardiza- 
tion of the Stanford-Binet, and is quite representative of an unselected 
group. The Stanford was administered first, and the Herring about 
two weeks later. Eventually the tests were scored by one person, the 
writer. 

In this way, data for both tests were obtained from 48 children. 
The IQ’s of the Stanford were then correlated with the Herring for 
each of the Tests A, B, C, D, and E, Pearson Product-Moments 
Formula, with the results as shown in Table I. 


Taste I.—CoRRELATION OF STANFORD- AND HERRING-BINET Test IQ’s 
Stanford Correlation with Herring 


Groups 
A B Cc D E 
r= .725 .726 .796 .772 .769 
PE + .046 + .046 + .0356 + .039 + .0398 


These correlations are, of course, fairly good, but they indicate 
that the Stanford and the Herring Tests are far from being equivalent. 
It is also interesting to note that Group C has a higher correlation with 
a lower probable error than any of the other groups, including D and E. 
Evidently the additional tests given in the latter two groups weaken 
the examination for these Grade I children. A correlation of the 
mental ages of the two tests was also made and is shown in Table II. 


TaBLe II.—CorRRELATION OF STANFORD- AND HERRING-BINET MENTAL AGES 
Stanford correlation with Herring 


Groups 
A Cc E 
r= .67 .824 . 787 
PE + .053 + .0312 + .037 


The results here corroborate those of the IQ’s, showing, as they do, 
only a little difference. 

It is also of value to observe how closely the various groups of the 
Herring Test correlate with each other. That such a relationship 
exists is indicated by Table I, as there is a variation of less than .08 
between them when correlated with the Stanford. This is also shown 
in Table III where Herring Group C, which correlated highest with the 
Stanford, is correlated in turn with Groups A and E. 


TaBLe III.—IQ CorreLaTion oF Group C wits Groups A anp E, Herrina 


TESTs 
Grovr C wiTH Groups 
A E 
r= 884 948 


PE + .0212 + .0098 
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Group A, while a very good preliminary test, can hardly be considered 
equivalent to Groups C or E with these children. Almost invariably 
Group A was too high when compared with the other Herring Forms. 

The Herring Tests tend to grade too high the child of a mental age 
of less than six. Out of a list of the nine lowest IQ’s, according to the 
Stanford-Binet all of them less than 85, only three were rated as dull 
by the Herring Group E. 

There are a number of commendable qualities of the Herring Tests 
which should not be lost sight of. It is quite an undertaking to pro- 
duce a general intelligence test of almost entirely new material. The 
pictures, while not as well drawn as the Stanford, are of everyday 
home scenes which, with one exception (that of the office scene) are 
interesting to the child. This is particularly true of the “fishing pic- 
ture’? and that of the “‘woman spilling the water.”” The problem 
situations and absurdities have less of the foreign touch than the earlier 
Binets. This is, of course, at the expense of standardization. 

The mechanical form, too, of the Herring Test is good. Almost 
all of the material is found in the examination booklet. The pictures 
and reading tests are inverted so that the examiner may sit at a table 
opposite the child and give him the test without turning the book 
about. His directions are printed so they will be upside down to the 
child. The suggestion of a celluloid cover to be placed over the pic- 
tures and diagrams is also a good one as it keeps them clean. 

Although the highest correlation in IQ of the 48 cases with the 
Stanford Tests was only .796, in 13 cases there was a variation of only 
two points, and in 22 cases there was a variation of only three points 
or less. This is suggestive. 

The length of time required to administer the Form A of the Herring 
Test to these children is from six to eight minutes. For Form E, which 
includes all of the intermediate forms, the time is only a little less than 
that required by the Stanford-Binet. 

Because of the limited number of cases it is difficult to dogmatize 
on the causes of variation. The following might be suggested: (a) 
This was a selected grade group and tests were restricted to Grade I; 
(b) There may be a variation in method of presentation; (c) The lower 
end of the scale grades too high; (d) The Herring examination is 
perhaps insufficiently standardized. Based as it is upon less than 200 
cases, the conclusions must be tentative. It is also questionable to 
rest the validity of a test on the correlation alone, or to overweight 
such a comparison. 
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The following observations seem applicable to the children studied. 
The test, Group A, was apparently too brief. With only two of the 
tests very practicable for Grade I children, this group gives an insuffi- 
cient survey. The lowest rating given in Group A mental equivalents, 
was five points, with an equivalent in mental age of 74 months. Ifa 
child could enumerate three, and in some cases two objects in each of 
the four pictures, and could describe one very briefly, he could obtain 
a score of five points, giving him a mental age of 74 months. Every 
child, with one exception, was able to do this, although from Stanford 
returns as well as from teachers’ reports there were several who were 
manifestly retarded mentally. The enumeration test in the Stanford 
Examination, it may be recalled, is placed in the three-year-old group. 
The experience with the Herring was that children with a mentality of 
three to four made too high a score on Group A. 

If the tests of Group A are correctly standardized, the following 
tests are too easy and therefore unnecessary since they were passed by 
every six-year-old child or on with a mental age of six by the Stanford 
Examination: Test 5, pointing to parts of body; Test 6, repetition of 
six syllables; Test 7, size comparison; Test 8, which is prettier? Test 
31, naming objects; Test 32, geometric form comparison. 

An interesting observation was noted in the test in which the child 
is asked to name the three shades; black, gray, and white. In several 
instances the gray was called brown. 

In general intelligence tests such as this, the standardized answers 
should not be too abbreviated as it makes too great a variation between 
the work of the individual examiners. The Herring answers, also, 
need further standardization. 

The difficulty arises in giving a long series of tests which are too 
hard for the child as he becomes discouraged. There is a loss of 
pleasure and interest. At the head of each group of tests there is a 
paragraph permitting the examiner to omit certain tests if the score in 
Group A is less than a certain number of points. The number of tests 
so omitted ought to be increased for Grade I children, as practically 
none of them can read and can not, therefore, respond to such tests as 
3, 11, 18, 20, 21, 22, 28, 29, 30, 37, and 38. 

Certain types of tests need balance. The repetition of such 
sentences as the following: “In winter boys and girls like to make 
snowballs’”’ is much harder for children who live in such places as 
California and have never seen snow or ice. Two such sentences 
were found in Test 23. The same answer, “round,” in the Similarities 
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Tests was found for the first question in Test 18 and in Test 26. In 
Test 15 there were too many questions involving money transactions. 
The ‘applicant picture’’ was more difficult than the other members of 
the group. The vocabulary examinations were beyond the pupils of 
Grade I. A good vocabulary test has proved very reliable in the past. 
The deductive form of proverbs is much harder and less stimulating 
for children than the inductive form. This affects Test 19. In the 
Maze Test, No. 34, describing the News Route, the instructions are 
too involved and long, a frequent fault in Maze Tests. The Verbal 
Tests seem to be of greater interest to the boys than to the girls. 

This report may be summed up by saying that for these Grade I 
children: (1) The highest correlation between the Stanford IQ’s and 
any group of Herring is Group C, r. =796, between mental ages of the 
Stanford and the Herring Group C, .824; (2) The chief difficulty with 
the Herring Tests is that they fail to evaluate properly the child of a 
mental age of less than six years; (3) The Herring Tests show a number 
of excellent qualities, but need further standardization. 








WHOLE AND PART METHODS IN LEARNING 
WARNER BROWN 
Psychological Laboratory, University of California 


In a recent number of this Journal H. B. Reed’ presents figures 

and arguments tending to break down the widely accepted opinion 
of psychologists that the whole method is superior to part methods of 
learning. He contends (p. 115) that ‘‘the most economical method 
of learning is in the last analysis a question for the individual learner. 
Most individuals find the part method superior.’”’ At another point 
he says (p. 113), “‘the general superiority of the part method may be 
due to the former study habits of the students, or it may be due to 
some inherent advantage which the part method has in the learning 
process of the individual. It would be difficult to find out which of 
these statements is true.” But a little later (p. 114) he says, ‘‘ But 
there must be some inherent advantage in the part method. 
The only way in which the mind learns is by parts. .. . Since 
this is the way the mind actually learns, why not follow some form of 
the part method and adjust the method to the need of the mind?” 
This argument is supported by data selected from Ebert and Meu- 
mann,' together with the repetition of their experiment by Reed® 
and by a table of selected findings from Pentschew.’ The painstaking 
investigation of Steffens'! giving 30 experimental series many of which 
represent 30 or more sittings, is not quoted and is rather slightly 
referred to as ‘‘only one experiment.” The work of Pechstein,® 
which is certainly very significant in the analysis of learning, but which 
has a very remote bearing upon the learning of verbal material, is used 
to offset the clear-cut-findings of Pyle and Snyder,® and of Lakeman.’ 
Larguier des Bancels* and Neumann’ are not even mentioned. 

If, as Reed says, ‘‘the evidence in favor of each method isimpartially 
weighed, as every scientific writer should do,’’ I believe we will find that 
the part method is inferior in every well-controlled experiment save 
Pechstein’s and part of Ephrussi’s* (pp. 86-89). The point about 
the whole method has always been that it possesses theoretical and 
experimentally demonstrable advantages over more “ natural’’ methods 
and that children should be taught to use it in place of their natural 
method. No one has ever claimed that untrained learners adopt 
the whole method spontaneously, or even willingly. One of the chief 
points made by Steffens when in 1900 she published her original paper 
on the economy of learning was that untrained learners who are left 
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to their own devices choose uneconomical methods of learning. It 
was stated repeatedly by Pentschew (pp. 492, 496) in 1901 that the 
whole method requires more effort and that if learners are permitted 
to choose their own rate, they will go slower on the whole method. 

“It is Reed’s contention that the whole method can be evaluated 
on the basis of time alone, and that measures of number of repetitions 
or of amount of retention are not necessary Here he allies himself 
with Steffens as against Pentschew and Meumann, who insisted that 
Steffens’ work was inadequate in so far as it did not measure the 
last named factors. Yet Reed does not accept Steffens’ data. Reed’s 
repetition of the work of Ebert and Meumann gives only measures of 
time. Since Ebert and Meumann gave their presentations at a uni- 
form rate of speed it would appear that the poorer score for the part 
method must be equivalent to a poor score for time. Reed does not 
make this point clear. The disagreement between Reed and Meumann 
has two roots: (1) Reed says of Meumann’s V-methods in which the 
material is learned through and through but with one or more pauses 
during the reading (these appear in Reed’s table, p. 108, as D and C) 
that they are ‘“‘forms of the part method.”’ (2) Reed now ignores 
Meumann’s! (pp. 196-199) careful discussion of retention in which 
he shows experimentally that the advantage of the mixed method 
for learning is offset by superior retention under the strict whole 
method. But, even as presented by Reed, the Ebert and Meumann 
data show that the strict part method is not the best for learning and 
is the worst for relearning. 

Pentschew found that when the rate of learning was not controlled 
the whole method “was not always the quickest” (p. 471). But 
Reed states categoricglly after giving figures ascribed to Pentschew 
“it will be seen that as measured by time both children and adults 
find the part method more economical.” The figures, which are said 
to be ‘a sample,” do not seem to me to be fairly representative of 
Pentschew’s findings and certainly do not agree with Pentschew’s own 
statement: “Fihrt das Lernen ‘im Ganzen’ auch im Kurzerer Zeit 
zum Ziele.”” When I count each experimental series in Pentschew’s 
report as a single unit I find an economy of time is shown in 10 series 
and that 4 of these are favorable to the whole method. Economy in 
repetitions is shown in 24 series of which 18 are favorable to the whole 
method. 

The new data presented by Reed?!® fall into three groups. (1) 
Individual studies of four persons. (2) Inconclusive returns from 
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an essay examination given after instructions to students to study 
by paragraphs or to study by chapter. (3) Experiments upon 113 
students learning 12 stanzas of poetry, 4 by the part method, 4 by the 
“part progressive method,” and 4 by the whole method. The score 
is given in average time consumed without any measure of the varia- 
bilities. The score given for the whole method is 5.95 minutes; for 
the part method, 5.45 minutes, leaving an advantage for the part 
method of 14 minute. Whether this is a reliable difference is left 
unsaid. 

The experiments of both the second and third groups contain the 
error of leaving the rate of reading uncontrolled, and have the further 
defect that no account is taken of the number of repetitions. 

Reed states that “‘there is only one theoretical objection to the 
part method and that is that it introduces some conflicting associa- 
tions.”” He ignores the advantage of leaving an interval following 
the repetition of any given element. If 12 items are learned, one 
item at a time, no interval follows any repetition of an item, whereas 
if they are learned as a whole, a substantial interval follows the repeti- 
tion of each item before its recurrence. If Reed is right in holding that 
the only theoretical objection to the part method is on the ground 
of association it is to be expected that vocabularies, into which consecu- 
tive associations do not appear to enter, will certainly not be learned 
economically by the whole method. In fact, Ephrussi,? whose results 
are otherwise favorable to the whole method, found that vocabularies 
do not lend themselves to whole learning, although it may be noted 
that she tested retention of vocabularies by means of backward 
association, t.e., the German word was learned at the left and first 
but the Russian word was given in the association test. But in 1906, 
G. Neumann® published a dissertation from Géttingen in which he 
showed a marked advantage for the whole method over a part or mixed 
method with children learning foreign vocabularies. 

I have attempted recently to reproduce Neumann’s experiment. 
I gave vocabularies of 12 words, each English word paired with a two- 
syllable nonsense word. For the part method, an English word was 
read aloud to a class (college students, both sexes), and after it the 
corresponding nonsense word. During this time the two words were 
shown on a card. This was repeated 12 times taking 30 seconds. 
Then the next pair of words was given 12 times for 30 seconds, and 
so on until all 12 pairs had been given; and then the whole list, each 
English word with its mate, was read over once. For the whole 
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method the entire list, each English word with its mate, was read 
over from beginning to end 12 times taking in all six minutes, the 
same amount of time used for the part method, except that the 
part method had one extra ‘‘whole”’ repetition not counted in the six 
minutes. During the entire six minutes a card was exposed upon 
which were the 12 pairs of words. 

As soon as one list, either whole or part, had been finished, a test 
was made by giving an English word and allowing 10 seconds for the 
writing of it with its nonsense mate. The 12 English words were 
given in an arbitrary order different from the order of presentation. 
Two lists of words were used, which can be called List A and List B. 
It turned out that they were not equally difficult. In some classes List 
A was used for the part method, and in some List B. In some classes 
the part method was given first, and after the test, the whole method. 
In other cases, the whole method was given first and the part method 
second. Thus the data fell into four compartments, as indicated in 
the accompanying table. It willbe observed that the same subjects 
appear in two compartments but that the comparison is always made 
between independent groups, that is, no subject is ever compared 
with himself. The figures indicate the mean number of words written 
perfectly, distributed according to part and whole method into the 
four compartments just mentioned. It is easily seen, of course, that 
the other groups are not strictly comparable because of the difference 
in difficulty of the lists and because of the effect of the order of presenta- 
tion. Under these four conditions it appears that the whole method 
gives a better recall than the part method for each of the groups 
compared. In three of the four cases the difference in favor of the 
whole method is three or more times its probable error. In only one 
of the four cases is the difference small enough to be suspected of 
unreliability. 











Mean Mean 
Number | Number number number | Difference 

cases cases words words in favor PE of 

for for recalled recalled of whole | difference 
whole part whole part method 

method method 

166 83 | Part given first, list A 1.92 ' 1.4 .52 .16 

83 166 | Part given first, list B 3.89 3.19 .70 .20 
142 124 | Whole given first, list A| 1.64 1.18 .46 .13 
124 142 | Whole given first, list B} 3.34 3.14 .20 17 





























Whole and Part Methods in Learning 233 


The data of this experiment seem to confirm Neumann’s finding 
that even when the material consists of unassociated elements, with a 
test for the elements in a deranged order, there is still an advantage 
in the whole method, if, as should be provided in any well-regulated 


experiment, the rate of presentation and the number of presentations 
is kept constant. 


I append the list of contributers to this discussion upon whose 
work educational authorities have based their conclusions. Of the 
10 Reed and Pechstein are the only ones who themselves dispute the 


advantage of the whole method or whose data are unfavorable to the 
whole method. 
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SEX DIFFERENCES IN GEOMETRIC ABILITIES 
FRANK C. TOUTON 
Associate Professor of Education, University of Southern California 


Scope of Study.—The report here made is based upon a critical 
study of the preferences expressed for certain types of geometric origin- 
als by 2800 New York high school pupils, and upon the achievements 
of these pupils in solving the originals selected. The exercises con- 
sidered were included in the June 1918 Regents’ Examination list, 
from which list pupils were required to select and solve at least four 
exercises and to prove four theorems, or they might select as many as 
eight exercises and prove no theorems. 

It should be noted here that it is required of high school pupils in 
the State of New York to take an examination (Regents’ Examination) 
set by a member of the staff of the State Department of Education 
upon the completion of each subject in the school curriculum. The 
2800 papers used in this study were selected from those submitted 
by approximately one hundred high schools. This selection of papers 
for the study was made from the papers submitted by the several 
schools in such a manner as to give a random sampling of the work 
done in the schools of the state, and to secure, if possible, approximately 
five hundred solutions written by each sex on each exercise. 

Since it is not the practice of New York State high school teachers 
to send to the Department of Education at Albany such papers as are 
not, in the judgment of the teacher, of sufficient worth to secure a 
passing grade on the examination as a whole, the papers examined 
probably contained work of a higher average grade than would have 
been found had all papers been studied which were written by the 
pupils in the geometry classes of these schools. Again, since the 
pupils were allowed to make certain selections from among the exercises 
of the list and since the quality of the work done on each exercise is 
considered by itself in the study, it is believed that the study affords 
a better than average sampling of the achievements of pupils in 
solving originals in geometry. . 

Through reference to the distributions of Table I, it will be observed 
that in the case of several of the exercises there was a high per cent of 
perfect scores. That fact makes it impossible to measure adequately 
here the variations in ability which would normally appear in the 
upper ranges of the distributions. Such differences are concealed by 
the piling up of perfect scores, which occurs in the case of several of 
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the exercises. We are, however, in position to contrast fully sex 
differences in interests for the upper portions of the groups, and to 
contrast the variabilities of the sex groups in achievements insofar 
as they apear in the lower two-thirds of each of the distributions, and 
in the entire range for three of the distributions. The large number 
of cases used (much larger than in most studies reported) gives relia- 
bility to the conclusions as to sex differences of groups selected by 
high school entrance and by passing the Regents’ Examination in 
Plane Geometry in the mental trait here called “ability to solve a 
geometric original.’ 

The examination as set in June, 1918 consisted of four theorems and 
nine original exercises. Since the work done by pupils in solving 
original exercises is considered by teachers of geometry to be a better 
measure of achievement in geometry than the work done in reproducing 
proofs of theorems commonly given in texts, it was decided to consider 
in this study only that work done by pupils in their attempts to solve 
original exercises. 

Exercises Considered.—The list of exercises is here given. From 


this list the pupils might select for solution as few as four, or as many 
as eight exercises. 


Exercises 5-13 Inclusive, Taken from the New York Regents Examination in Plane 
Geometry of June 20th, 1918 


(The first four exercises were theorems to be proved and are not listed below.) 


5. The area of a square inscribed in a circle is 16 square inches. Find the area 
of the circle. 

6. The radius of a circle is 15inches. Through a point 5 inches from the center 
a chord isdrawn. What is the product of the two segments of the chord? What 
is the length of the shortest chord that can be drawn through that point? 

7. The base of a triangle is 20 feet; the other sides are 10 and 16 feet. A line 
parallel to the base cuts off 2 feet from the lower end of the shorter side. Find the 
segments of the other side and the length of the parallel. 

8. Construct a line tangent to a given circle and parallel to a given line outside 
the circle (to receive credit construction lines must be shown). 

9. (a) An exterior angle of a regular polygon equals of a right angle. Find 
the number of sides of the polygon. 

(b) The difference between two angles inscribed in the same circle is 20°. 
What is the difference between two central angles subtended by the arcs of the 
inscribed angles? 

(c) If the side of one equilateral triangle equals the altitude of another, what is 
the ratio of their areas? 


(d) The sides of a triangle are 3 feet and 8 feet. What are the numerical limits 
of the third side? 
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10. ABC is a triangle. D is the foot of the perpendicular from A on BC, P 
is the middle point of BC, X is a point on BC such that XP = PD. If the line 
through P perpendicular to BC meets AX at M, prove that MB = MC and MX = 
MA, 

11. From the external point P a secant PM is drawn to the circle so that 
it is bisected by the circle at N. MD is the diameter through M. Prove that 
PD is equal to the diameter and state a simple method for drawing a secant 
from an external point to the circle so that it will be bisected by the circle. 

12. Prove that a line perpendicular to the side of a right triangle at its mid-point 
passes through the mid-point of the hypotenuse. 

13. ABC is a triangle and AD, BE, and CF are its medians. DH is drawn equal 
and parallel to BE andcuts AC. Prove HA is equal and parallel to CF. 


Items Reported Upon.—It is my purpose here to report on the 
following items: 

1. The differing strengths in the appeals made to boys and girls 
by the several exercises of the list from which they were required to 
make their selections. , 

2. The achievements of the sex groups expressed in terms of the per- 
centage of one group equalling or excelling the median of the other group. 

3. Sex differences as shown in achievements in solving exercises 
differing in type. 

4. The correlation of the sex groups in geometric abilities. 

5. Certain difficulties experienced by girls in greater degree than 
by boys in solving the construction exercise. 

6. Variation of the sex groups on each exercise and in the complex 
trait—“‘ability to solve geometric originals.”’ 

Tabulated Data.—The distributions in Table I, which follows, show 
the successes attained by the pupils in their work on each exercise. 
The scores used were those either made by or approved by the 
examiners on the staff of the State Department of Education at Albany, 
New York. These scores were assigned to the several exercises ans 
not to the examination as a whole. They afforded the best measured 
of pupil geometric ability which were available. 

The correlation between sex and preference found in the choice of 
certain exercises, and between sex and success (or failure) in solving the 
exercises selected are set down in columns F and K of Table II. That 
table also gives in column H the measures of average ability, and in 
column I the measures of variability for the sex groups on each of the 
exercises. 

Probability of a Sex Difference in the Choice of the Geometric Exercises. 

Through the use of a method given by Pearson and of tables calcu- 
lated by Elderton (found in “Tables for Statisticians and Biometri- 
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cians” by Pearson), we are able to measure the correlation (probability 
of association) between being a boy rather than a girl and selecting 
an exercise for solution rather than rejecting it. 

In this process, use is made of items reported in columns B, C, 
and D of Table II, which are based upon the observed data for selec- 
tions and rejections of each exercise. Here the observed data are 
compared with theoretical data, or the data which would have resulted 
had the factor of chance only, and not the differing appeals or sex 
preference, operated in the making of selections of exercises for solution. 

The process of computing the correlation coefficient (the Tetra- 
choric r) is not here given. Any one interested in the use of this cor- 
relation coefficient can examine the method given in the tables referred 
to above. This coefficient is itself one root of a sixth degree equation. 
These r’s, with the probable error for each, are given in column F of 
Table II. 

An examination of these coefficients of correlation (Tetra-choric 
r’s) in column F of Table II and of the items in column £ in Table II, 
shows a zero correlation or pure chance (1:1) relationship between 
being a boy rather than a girl, and selecting Exercise 5 rather than 
rejecting it. This result may be in part due to the fact that Exercise 
5 is briefly stated, is apparently easy to solve, and comes first in the 
list from which selections are to be made. Under the situation pre- 
sented, this exercise appeals as strongly to boys as to girls. 

The correlations in the case of Exercises 5, 6, 7, 9, 10, and 11 are 
so low that when taken in connection with their probable errors, it 
is evident that no significance can be attached to them. The negative 
correlation (—0.13 + 0.02) in the case of Exercise 13, indicates that 
boys either were not appealed to by the statement of the exercise, or 
that they were more generally satisfied by the other options in the list, 
or that they used more discrimination in avoiding a very difficult 
exercise. 

It is possible that the low but significant and reliable correlation 
(r = 0.17 + 0.02) in the case of Exercise 12 indicates that the right 
triangle situation there presented makes a stronger appeal to the boy 
than to the girl. That a real difference in appeal exists in the case of 
this exercise appears from column E where, from item (1:6540), it is 
seen that the observed difference is not a chance difference; for a 
difference between observed selections and chance selections so great 


as was here found, could occur as a matter of chance only once in 
6540 trials. 
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The significant and reliable correlation (r = 0.31 + 0.02) in the 
case of Exercise 8 indicates clearly that the construction exercise there 
presented makes a decidedly stronger appeal to boys than to girls. 
That such a difference in preference as is expressed by the sex groups 
for this exercise cannot be accounted for on a pure chance basis is 
evident from the item (1:559,000) in column £ of Table II, which 
states the fact that such a difference between observed and chance 
selections as was found could not occur as a matter of chance oftener 
than once in 559,000 times. 

That there is a sex difference in the preference for such a construc- 
tion exercise when various options are offered is, then, practically 
certain. It is, doubtless, to be accounted for on the basis of the known 
strength of the manipulation-experimentation interest in boys. Other 
psychological data are usually interpreted to show that boys are by 
original nature interested in things and their mechanism, while girls 
are chiefly interested in people and what they do. 

Sex Differences Expressed in Terms of the Percentage of Boys Who 
Equal or Excel the Median Girl.—The items given in column J of 
Table II indicate for each exercise the percentage of boys who equal 
or excel the median girl. Since in each of the nine cases more than 50 
per cent of the boys excel the median girl, and since the average of 
these percentages is about 61 per cent, we can say that in the complex 
mental trait called ‘ability to solve geometric originals’ the median 
boy does excel the median girl. 

Miss Rusk, using high school grades, states that in geometry 53 
per cent of the boys reach or excel the median ability of the girl.! 

Thorndike, using Regents’ Examinations and school marks, states 
that in mathematics 57 per cent of the boys reach or excel the median 
ability of the girls.? 

The difference here shown in geometric ability, 61 per cent of the 
boys excelling the median girl, is somewhat higher than has been found 
by other investigators, yet here, as in other studies, the most charac- 
teristic fact about this observed difference is its small amount. The 
amount of overlapping in the case of the distribution of Exercise 
II, where 60 per cent of one group equal or excel the median of 
the other, appears in the overlapping areas of the following frequency 
polygons. 


1 Thorndike: Educational Psychology, Vol. III, p. 184. 
2 Ibid, p. 183. 
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While the data on geometric abilities reveal a slight superiority of 
the male sex (61 per cent of the boys excel the median girl), yet the 
wide range of individual differences in each sex group (see the fre- 
quency distributions for each exercise and for each sex in Table I) far 
outweighs in importance the small difference between the medians of 
the groups, as has been previously observed elsewhere. 

‘So far as ability goes, there could hardly be a stupider way to get 
two groups alike within each group but differing between groups than 
to take the two sexes.’ 

Sex Differences in Ability to Solve Types of Exercises.—Reference 
to column A of Table II or to the question list of page 235 brings out the 
following facts: 

(a) Exercises 5, 6, 7, and 9 were of a quantitative type requiring 
the use either of arithmetic or algebraic processes in their solutions. 

(b) Exercise 8 is an exercise requiring knowledge of methods of 
construction. | 

(c) Exercises 10, 11, 12, and 13 require formal proof and syllogistic 
organization. 


By comparing the percentage of boys who equal or excel the ability : 


of the median girl for each of the three types, we get a measure of the 
adaptation of each type of exercise to the sex groups. 

Considering columns A and J, we note that: 

(a) The percentages of boys excelling the median girl range in 
numerical exercises from 57 per cent to 64 per cent with an average of 
61.2 per cent. 

(b) The percentages of boys equalling or excelling the median girl 
in the construction exercise is 66 per cent. 

(c) The range in the exercises requiring proof is from 57 per cent 
to 67 per cent with an average of 60.7 per cent. 

It follows from the above data that in solving numerical exercises 
boys show practically the same amount of superiority over girls that 
they show in solving exercises involving proof and syllogistic organiza- 
tion. Again, the greater superiority of boys (66 per cent of the boys 
surpass the median girl) in the construction exercise is not particularly 
significant. It is evident, then, that to differentiate among the types 
of geometric originals (numerical, construction, or proof) to be given 
to the sex groups would not be supported by data on differences in 
geometric abilities here observed. 


1 Thorndike: Educational Psychology, Vol. III, p. 184. 
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Abilities of Boys and Girls in Solving Geometric Originals Expressed 
through the Use of a Coefficient of Correlation—the Bi-serial r..—The 
items in columns G and H of Table II indicate that, in the case of each 
exercise, a higher percentage of boys than of girls made perfect scores 


on the exercise selected. It is also seen that the mean score of the: 


boys is higher than the mean score of the girls on each exercise. 

The coefficients in column K’ of Table II show the correlation 
between being a boy rather than a girl and excelling at least one-half 
of the total group rather than being surpassed by at least one-half 
of the total group is positive in the case of each exercise, and, in all 
save the first exercise, the coefficients have considerable reliability. 

In contrasting the items of columns E and K of Table II, it will 
be observed that, in general, along with the strength of the preference 
shown by the boys goes corresponding success in excelling more than 
one-half of the total group in solving the exercises selected. This 
fact is particularly outstanding in the case of the construction exercise. 

Having determined the type of exercise (the construction exercise) 
in which the sex preference appears greatest, and in which the correla- 
tion with success is highest, we shall now utilize data gained from the 
analysis of work done by pupils in solving the exercise to determine 
the particular mental traits which may account for the observed sex 
difference. 

Differential Difficulties Experienced by Boys and Girls in Carrying 
Out a Construction Exercise.—As has been previously stated the highest 
correlation [r = 0.31 + 0.04 in Exercise 8 of column F] between being 
a boy rather than a girl and selecting a certain exercise rather than 
rejecting it, was found in the preference shown by boys in their select- 
ion of the construction exercise. Likewise, in this construction exercise 
the highest correlation was found (r = 0.38 + 0.02 in Exercise 8 of 
column K) between being a boy rather than a girl and excelling the 


1 The Bi-serial r was first used by Pearson in ““Biometrica,’’ Vol. VII, though it 
received even later the designation which it now retains. It expressed, as here 
used with achievement scores, the relationship between two series of scores through 
the use of (a) the mean of the distribution having the larger number of measures, 
(b) the mean of the two distributions combined as one, (c) the standard deviation 
of the combined distribution, and (d) the percentage which the larger series of 
measures is of all measures combined. These data and the use of a table by 
Shepherd in Pearson’s “Tables for Statisticians and Biometricians’’ make possible 
the computation of the coefficients showing the relationship between being a boy, 
rather than a girl, and doing better or worse in the solution of each of the original 
exercises in the list. 
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median of the total group in the solution of the exercise. Again in this 
exercise the ratio of the variability of boys to the variability of girls 
(0.67 in column M of Table II) was lowest. The sexes seem to vary 
most in doing that exercise on which the strongest preference is expressed. 

The detailed data on the analysis of the solutions of this exercise 
(not here quoted), show that of the 501 boys and 579 girls who 
attempted the solution, 455 or 91 per cent of the boys and 466 or 80 per 
cent of the girls noted that by the conditions of the problem both the 
circle and the line outside of it are given and fixed in position. Appar- 
ently a higher percentage of girls did not note that by the conditions 
of the problem the circle and the line outside of it were given and fixed 
in position. Eleven per cent more girls than boys failed to comprehend 
the fact that the given elements of this problem have fixed positions. 

From the data quoted above, we are able to express, by the means 
of Pearson’s square contingency coefficient, the correlation between 
being a boy rather than a girl and noting the facts that the circle and 
outside line are fixed by conditions of the problem. The method is 
here shown: 


























Interpretation of given elements Boys Girls Total 
Chiniemsh SE. Scag hcchwebeuraedsc css 455 446 901 
tel AEE Cee | (419) (482) 
In t iN hia ad dk iin wa fs 48 132 180 
igen » —(‘i«é‘(éMR pM” SA (84) (96) 
503 578 1081 
. GS , (ae , (~-36 .. Ce . 
rX_ = 419° + 432 + 34 + 96 34.78 
Coefficient of square contingency = J a +0.18 
1115.8 


It should be noted here that the square contingency coefficient 
here used could not have exceeded 0.701 since it is derived from a four- 
fold table. 

It follows from the above that an association between being a boy 
rather than a girl, and excelling in this mental trait is probable; that 
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is, it is probable that girls as a group require more training than do 
boys in determining the fixed relations and conditions of a construction 
problem. 

Moreover, in these construction solutions, the data show that 
85 per cent of the boys and only 64 per cent of the girls use the short 
method of determining the point of contact of the required tangent. 
In this regard the square contingency coefficient shows a probability 
of association of 0.24. Here, too, there seems to be a real sex difference. 

In carrying out the instruction to “show clearly all construction 
lines’”’ (see text of Exercise 8) the contingency or probability of associa- 
tion between being a boy rather than a girl and following the instruc- 
tion is 0.14, a significant though not high degree of relationship. 

From the foregoing facts it seems clear that in reaching construction 
exercises, more care is required with girls than with boys in distinguish- 
ing between the given and required elements of the problem. 

An adequate explanation why sex differences take the form shown 
is not here attempted. It must suffice to establish a method of pro- 
cedure in locating these differences, and to supply some data to show 
sex differences in fairly elementary processes. 

Variabilities of the Sex Groups in Achievement.—In terms of a 
standard deviation (an amount which must be added to and subtracted 
from the central tendency to get the range of scores which will include 
68.32 per cent of cases normally distributed) it is possible to express 
roughly the variability of the scores in a series. By reference to data 
in columns H and I of Table II, it is not difficult to see that in only 
three out of nine exercises are the achievements of boys more variable 
than those of girls. Now the standard deviation is a fairly good index 
of variability, but in contrasting groups the coefficients of variability 
(standard deviation divided by the mean of the series) give a more 
reliable measure of variability. The items in column L of Table II 


indicate that in each exercise save the last boys are less variable than , 


girls. The result in the last exercise may be due (see the distributions 
of Table I) to the very low mean score of the girls, in fact only 4 per 
cent of the girls attempting the exercise succeeded in getting a score 
which indicated that their solution was more than half correct, while 
22 per cent of the boys who selected the exercise succeeded in getting 
the half correct score or better. Again the percentage of boys making 
a perfect score on this exercise (see column G of Table II) was 744 
times as great as that of girls. Now since the mean score for boys 
was higher than for girls, and since a higher percentage of boys suc- 
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ceeded in really solving the problem, it is obvious that their scores 
might easily vary more. 

The ratios in column M of Table II are quotients of the variability 
coefficients that is, boy coefficient of variability divided by girl coeffi- 
cient of variability. These were given in column L. They indicate 
the fact that in all of the exercises save the last, a situation accounted 
for above, boys of the selected group under consideration are less 
variable in the mental trait called “ability to solve geometric originals” 
than are girls. 

It may be here repeated for emphasis that the lowest ratio of the 
variability of boys to the variability of girls is found in that exercise 
(the construction exercise) in which the most marked preference was 
shown. In the particular trait in question, ability in solving a con- 
struction exercise, boys and girls differ more than in the solution of 
either of the other two types of exercise. 

The ratios of boy to girl variability here found [see items of column 
M], showing boys in general less variable than girls, do not support 
the findings of previous investigators. Since their investigations were 
limited as to number of cases, and since this investigation is extensive 
(about 500 of each sex for each exercise), there seems to be evidence 


. pointing clearly to the fact that in ability to solve geometric originals 


of the quantitative, construction, and proof types, boys are less variable 
than girls when the groups under consideration show truncated dis- 
tributions, having been cut off at the lower end of the curve by the 
fact that a selected group enters the high school geometry class and 
the fact that only the better papers of each sex group are here con- 
sidered, and cut off at the other end of the curve by the fact that in 
several of the exercises there are a large number of undistributed 
perfect scores. 

Summary of Findings.—The report here made on the work done by 
boys and girls in a New York State Regents’ Examination in Plane 
Geometry, brings out the following facts for the selected groups under 
consideration: 

1. Boys show a decidedly stronger preference than do the girls in 
selecting for solution the construction exercise. 

2. The median boy slightly excels the median girl, or somewhat 
more than 50 per cent of the boys excel the median girl, in achievement 
shown in solving each of the originals of the list. This difference is 
not great or even significant, for the individual differences in each sex 
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group far outweigh in importance the differences between the medians 
of the groups. ; 

3. Boys and girls of the truncated groups here considered seem to 
differ no more widely amongst themselves in their ability to do one 
type of exercise than in their ability to do another. 

4. A correlation coefficient was used to relate the geometric abilities 
of the sex groups. The highest correlations between being a boy rather 
than a girl and excelling the average of the total group in the solution 
of an exercise are found in the case of those exercises where the greatest 
differences in preferences are expressed. 

5. In the solutions of the construction exercise with the selected 
groups under consideration, it was found that a higher percentage of 
girls than of boys experienced difficulty in distinguishing between the 
given and required elements of the problem. 

6. The measures of ability to solve geometric originals which were 
made in the study show for the selected groups here considered that 
boys are less variable than girls. 








DISTRIBUTED PRACTICE IN ADDITION 


H. B. REED 
Grinnell College 


It appears that no clear statement can as yet be made upon the 
effect of distributed practice upon improvement. On the one hand, 
the experiments of Starch on substitution, of Pyle on typewriting, of 
Kirby on addition and division, and of Ulrich on white rats apparently 
justify the rule; the more distributed the practice the more rapid the 
improvement. On the other hand, the experiments of Wimmer on 
reasoning problems, of Thorndike and Hahn on addition, and of Thorn- 
dike on multiplication indicate that relatively long periods are just 
as profitable if not more profitable than short ones. Undoubtedly 
the age of the learners, the quality of the work, and the amount of 
fatigue are conditioning factors. The experiment reported below 
indicates that within certain limits the length of practice may be 
varied at will without any important difference in the amount of 
improvement. 

Two hundred and three first and second year college students 
worked one hour in adding problems of five two place numbers having 
no zeros. The time was distributed among four groups as follows: 
Group I, 60 students added one hour continuously; Group 2, 50 stu- 
dents added 20 minutes a day for three days; Group 3, 51 students 
added 10 minutes a day for six days; Group 4, added 10 minutes 
twice a week for three weeks. Time was marked by all groupsevery 
10 minutes. 

The number of examples attempted and the number added cor- 
rectly were calculated for each 10 minutes of the hour. We may there- 
fore measure the amount of improvement by either the absolute or 
relative amount of gain from the first to the last 10 minutes both in 
the attempts and the rights. The following table gives the average 
for the respective groups: 











Attempts Gain Rights Gain 
First 10 | Last 19 | Abso-/| Rela- Abso- | Rela- 
minutes | minutes| lute | tive First | Last lute | tive 
Group 1—60 minutes .......... 42.9 47 4.1 | 10.9 | 40 44.7 4.7 | 12.2 
Group 2—20 minutes .......... 43.7 58.4 14.7 | 35.9 | 40.7 | 56.4 | 15.7 | 43.4 
Group 3—10 minutes .......... 47.2 62.7 15.3 | 33.1 | 46.3 | 59.8 | 14.9 | 33.6 
Group 4—10 minutes .......... 45.1 59.7 14.6 | 28.6 | 42.3 | 57.2 | 14.8 | 35.1 
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It will be seen that the group which worked 20 minutes a day for 
three days made the most improvement but not much more than those 
who worked 10 minutes a day for six days, or those who worked 10 
minutes twice a week for three weeks. The three forms of distributed 
practice were much more profitable than the hour of continuous prac- 
tice. Possibly the last is so unprofitable because of fatigue, but as I 
have found little fatigue from 10 hours of continuous addition with 
such problems, the amount of fatigue from an hour of work must be 
very slight. Woodworth explains the beneficial effects of distributed 
practice as due to the greater frequency of the periods of stimulated 
nutrition for the nurones. But this also must be slight from such a 
small amount of exercise. Neither factor seems adequate to explain 
the good results from distributed practice in this case. Theoretically 
however it seems that the value of the extent of distribution is deter- 
mined on the one hand by the amount of forgetting and the after effects 
of stimulated intuition, and on the other by the amount of fatigue and 
the failure to get the benefits of stimulated nutrition in a repeated 
exercise. Within these limits there is a wide range within which one 
length of practice is just as profitable as another. 
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TWO NOTES ON STATISTICAL METHOD 
RAYMOND FRANZEN 


University of California 


I 


In using crude score formulas of r two important ‘advantages besides 
avoiding deviations need consideration: (1) Having the M’s and o’s 
without further work and (2) having a formula which is easily capable 
of job analysis. The following formula has these two advantages. 
By a use of the usual tables for squares and multiplication, ordinary 
clerical workers can use it under direction. 


X = obtained measures 
Y = obtained measures 
x=X-—M, 
Y=Y-M, . 
zry = 2(X “ae M.)(Y aoa M,) 

Ya? \l¢ Zy?\"4 =(X — M,)*\47 2(Y — M,)*\% 
Se Ope ery 








T= 











ZXY 








E a ee 
(7 - me.) "(7 - wy 


In the formula: 





oa@t-my" 
“= (22-0) 


Actual computation of many r’s where the quantities vary between 
0 and 100, or may easily be made to do so by the subtraction of a con- 
stant, has resulted in the choice of the above formula because no extra 
computation or graphing is necessary to get r, both o’s and both M’s. 





II 


It is quite common in practice to state the conclusion of an 
experiment as the difference of two means. It is necessary then to 
know the reliability of this difference. The o of a difference is in part 
afunction of the r of the terms of the difference. Some confu- 
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sion results when the terms of this difference are M’s and we know only 
the r of the original measures. We may show that the o of the differ- 
ence of two M’s may be computed with only this information. 
The o of a difference may easily be computed. 
— B)/2 4 
me: ae M?,»| 


= (0%, + 0%, — 2rasoacs)” (1) 


The difference of two averages is equal to the average of the 
differences of the original measures. 





Ts—B = 


=(A — B) 


o of M, — Mz =a of " 





By the formula for the o of an average: 
z(A — B)_oA—B _ 
og of n sags ny _ (2) 








Substituting (1) in (2) 


(o4? + a’, ae ey (3) 


Ca OB +4 
AB TY ny 





2 2 
CA Og 

= (+ 4-2 — 2 
n + nm 

sa (ou, + ous wi 27 450 uF up)” 


Then when we desire to know the's of the difference of averages, 
we may use the o of these averages and the correlation of the original 
measures to substitute in the formula for the o of a difference. 
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NOTES ON ARTICLES IN EDUCATIONAL 
PSYCHOLOGY IN CURRENT ISSUES OF 


mes OTHER MAGAZINES +m 


REPORTED BY CECILE COLLOTON 
Department of Educational Psychology, The Lincoln School of Teachers College 











INTELLIGENCE TESTS 


The Nature of General Intelligence and Ability. Godfrey H. Thompson. The 
British Journal of Psychology, 1924, January, 229-235. The positive correlations 
between a man’s ability and his performance in general are due to a variety of 
factors—not to the possession of a “general ability.” 

The Nature of General Intelligence and Ability. L.L. Thurstone. The British 
Journal of Psychology, 1924, January. States the main characteristics and intrin- 
sic nature of intelligent action as contrasted with unintelligent action. 

The Diagnostic Findings from Seven Years of Examining in the Same School 
Clinic. J. E. W. Wallin. The Journal of Delinquency ,1923, May-July, 169-195. 
Two thousand seven hundred seventy-four consecutive school cases individually 
examined at the St. Louis Psycho-Educational Clinic are reported. Special 
types are discussed in detail. Nineteen tables present statistical data. 

Who Shall Go to College? William Orville Allen. School and Society, 1924, 
Feb. 23, 230-232. Reports a trial survey of the intelligence of public high school 
seniors in 19 selected schools in Pennsylvania. Defines a good college risk 
in terms of an intelligence test score. 

The Intelligence of Women Graduates of Colleges of Liberal Arts Entering the 
Teaching Profession. Agnes L. Rogers. School and Society, 1924, Feb. 16, 201- 
202. Results from the Thorndike Intelligence Examination show that women of 
Goucher College choosing teaching as a profession represent a random sample of 
college graduates in intellectual capacity. 

The Application of the Pintner Group Test to Misdemeanants. Grace Hamill. 
The Journal of Delinquency, 1923, May-July, 158-167. Reports the results of 
testing 2000 misdemeanants over a period of seven months. The group in general 
is inferior with native whites ranking highest, negroes next and foreign whites 
lowest. 

Freshman Academic Achievement in College of Students Presenting Four Years of 
Latin and Those Presenting No Latin. Andrew H. MacPhail. School and Society, 
1924, March 1, 261-262. A comparison of 27 pairs of freshman students at Brown 
College equated on the basis of intelligence test scores show practically no difference 
in academic achievement between the Latin and No-Latin group. 

A Study of Some Factors Causing a Disparity between Intelligence and Scholarship 
in College Students. Donald A. Laird. School and Society, 1924, March 8, 
290-292. A study of 10 factors such as indolence, ‘dates,’ outside work, etc., 
indicate sources of disparity between results of intelligence tests and college grades. 
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Mental Test Scores and Self-regard. Norman Fenton. Educational Adminis- 
tration and Supervision, 1924, February, 103-107. Knowledge of tests scores 
frequently proves discouraging to students in normal schools and affects their 
interest and effort. 

Some Notes on the Mental Status of the Left-handed. Kate Gordon. The 
Journal of Delinquency, 1923, May-July, 154-157. The 7 per cent of left-handed 
children found among 1019 dependent children are not inferior in mental level to 


the right-handed. 


EDUCATIONAL TESTS 


Research Tests in United States History. Olivia C. Penell. The Historical 
Outlook, 1924, March 128-143. Reprints of history tests for Grades VI, VII, and 
VIII used by the Department of Educational Investigation and Measurement of 
the School Committee of Boston. 

A Study of the Progress of Newsboys in School. Charles S. Meek. Elementary 
School Journal, 1924, February, 430-433. Test results in reading and arithmetic 
show that selling and distributing papers do not hinder boys’ progress in school. 

An Analysis of Pupils’ Errors in Fractions. R.L. Morton. Journal of Educa- 
tional Research, 1924, February, 117-125. Diagnostic tests show individual 
difficulties and make the use of practice exercises more effective. Statistical 
details of the study are given. 

Making a Diagnostic and Cumulative Survey of School Achievement: M. J. Van 
Wagenen. Educational Administration and Supervision, 1924, February, 79-93. 
Outlines a procedure for conducting a comprehensive survey of school achievement 
emphasizing certain features upon which the success and value of the survey 
depend. 

The Modern Test. Benjamin B. James. School and Society, 1924, Feb. 23, 
208-213. Argues for a frequent use of completion, true-false, and other such 
types of short tests instead of the lengthy mid-term or final examination. Dis- 
cusses advantages to both student and teacher. 

Reliability of the True-false Form of Examination. William H. Batson. Edu- 
cational Administration and Supervision, 1924, February, 95-102. A comparison 
of the true-false and the essay examination with a list of advantages in favor of 
the true-false type. 


MISCELLANEOUS 


A Restatement of Important Educational Conceptions of Dewey in the Terminol- 
ogy of Thorndike. Laura M. and Clara F. Chassell. The Journal of Educational 
Method, 1924, March, 286-298. Quotations from Dewey’s “Democracy and 
Education” and “Interest and Effort in Education” are restated in parallel 
columns in Thorndike’s terminology. 

The General and Specific Effects of Training in Reading with Observations on the 
Experimental Technique. Arthur I. Gates and Dorothy Van Alstyne. Teachers 
College Record, 1924, March 98-123. Describes important types of reading ability 
and reports an experiment to determine transfer from one type of reading to others. 
Transfer small-training must be specific. 

Arrested for Speeding: A Hundred Million Americans. Garry C. Myers. 
The Journal of Educational Method, 1924, March, 299-302. Speed in learning 
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results in error. Emphasis upon accuracy in earlier stages of learning results in 
greater speed later. 

Some Extra-intellectual Factors in Delinquency. Franklin S. Fearing. The 
Journal of Delinquency, 1923, Jay-July, 145-153. Classifies the psychopathic 
delinquent into personality types of (1) the emotional, (2) the inadequate, (3) 
the egocentric, and (4) the paranoid. Discusses the IQ distribution and social 
environment of the groups. 

Intelligence in Motor Learning. M. Gopalaswamii. The British Journal of 
Psychology, 1924, January, 274-290. Describes an attempt to discover the general 
nature of motor learning in mirror drawing. Infers the mental processes of children 
by observation during the tests. Introspective reports by two psychologists 
confirm the observations. 

Is the Pedagogically Accelerated Student a Misfit in the Senior High School? 
Margaret M. Alltucker. The School Review, 1924, March, 193-202. Reports a 
study of 135 cases accelerated from nine months to two years and nine months in 
high school. Each case was studied individually. Mental pedagogical, physical, 
and social ages should all be considered in rapid advancement. of pupils. 

An Experiment in Sectioning Freshman English. Jennie M. Constance and 
Joseph V. Hanna. Educational Review, 1924, March, 150-153. Dividing 
Freshman classes in English on the basis of psychological examinations and diag- 
nostic tests in English is very satisfactory to both students and instructors. 

Size of Class and School Efficiency. R. E. Tope, Emma Groom, and Marvin F. 
Beeson. Journal of Educational Research, 1924, February, 126-132. Compared 
by means of test scores the pupils in an eleventh grade class in English of 34 pupils 
did better work than the classes of 20 and 44 pupils. Evidence not conclusive. 

A Questionary Study of Certain National Differences in Emotional Traits. 
Margaret Floy Washburn. The Journal of Comparative Psychology, 1923, 
December, 413-430. Responses of Russian Jewish, Southern Italian and Polish 
women to questions as to their greatest source of pleasure, fright, and anger show 
fairly well-marked differences between the national groups. 

The Selection of Tasks of Equal Difficulty by a Consensus of Opinion. Edward L. 
Thorndike, Elsie O. Bregman, and Margaret V. Cobb. Journal of Educational 
Research, 1924, February, 133-139. The correlation between the difficulty of 
100 tasks as rated by each one of 40 teachers, or graduate students of psychology, 
and as determined by the percentage of failures in a ninth grade group is about .92. 

Proposals for Standardizing Measurement in Education: III The Progress 
Index. Harry 8. Will. Journal of Educational Research, 1924, February, 140- 
146. Proposes a measure for comparing the results of instruction in different 
school communities which takes into account not only total attainment but also 
the rate at which attainment is achieved. 

An Experiment in a Rural School. Fannie W. Dunn and Marcia Everett. 
Teachers College Record, 1924, March, 144-155. Describes a one-teacher school 
in which means and materials are adapted to conditions with noteworthy results. 
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REAL SocraL PsycHoLoGy 


Social Psychology, by I. H. Allport. New York: Houghton Mifflin 
Company, 1924. Pp. XIV + 453. 


Real satisfaction results when the content of a book carries out 
the implications of the title and the preface. Satisfaction is increased 
when the reader finds an author who not only knows what others have 
done and thought, but proceeds to a fresh organization of ideas from a 
viewpoint for which he clears the way by a sane, impersonal and 
constructive critique. This particular reader was first impressed by 
the preface. The author tells why his organization is different. The 
usual opening chapters of books on social psychology areomitted. The 
last chapter in this book accomplishes what a number of opening chap- 
ters usually purport to do. 

There is no padded discussion and presentation of distribution 
and classifications of social groups. The viewpoint is psychological 
and the social behavior of individuals, their responses to various 
types of social stimulation, and their adjustment to social sanctions 
and social standards are analyzed with reference to the effect on 
the individuals who, as component factors, make up the total situation. 
The author has made an exceedingly valuable contribution in provid- 
ing a fine critique of certain Freudian concepts, and presenting alter- 
natives which are in line with a more objective treatment of such 
aspects of behavior. Such alternatives are frankly presented as 
hypothetical but in harmony with known facts although they still 
require experimental evaluation. Numerous experiments are reported 
and others are proposed. 

The author is no doubt a good teacher. He presents controversial 
material in a manner which makes for open-minded consideration 
and clarification of issues, and gives the reader facts and bases for 





1 Unsigned reviews are prepared by L. Zirbes. 
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decisions and generalizations before he overwhelms him with argu- 
ments and abstractions. 

The book should prove of genuine practical value to serious 
educational and social workers. The reviewer’s copy is thoroughly 
marked for reference in the solution of a number of problems. 





PREDICTION—DIFFERENCES—GROUPING 


A Critical Study of Certain Measures of Mental Ability and School 


Performance, by Inez May Neterer. Baltimore: Warwick & 
York, Inc., 1923. Pp. 141. 


This monograph reports the attempt to find out from various 
measures applied to 329 pupils in the 4B grades of eight Seattle 
schools in 1919-1920, of what value such measures are for predicting 
school success, for revealing individual differences, and for classifying 
more homogeneously. The measures used were the Stanford 
Revision of the Binet Scale, the Otis Intelligence Scale, the Woody 
Arithmetic Scale, The Courtis Arithmetic Tests, The Starch Arith- 
metic Scale A, the Monroe Silent Reading Test, teachers’ estimates, 
and teachers’ marks. 

There are no happy conclusions. A profusion of insignificant 
correlations shows the unreliability of any of these measured for pre- 
dicting score in any of the others. The startlingly wide ranges 
of ability, both for individuals and for classes, show how lacking in 
homogeneity these 4B pupils are, even after misfits have been taken 
out. The persistence of scarcely less wide ranges under the other 
measures, of pupils selected on the basis of a narrow range in one mea- 
sure, shows the worthlessness of any of these measures for reclassif ying 
pupils. 

The study is painstaking, authentically reported, excellently 
bibliographed, and meticulously footnoted. Its weaknesses arise 
from what are still, in this field, excusable inabilities to tell just what 
each of these measures measures, to establish just what is a teachable 
homogeneity of pupils, and to corral and isolate certain vagrant forces 
which probably after all are doing most of the mischief. 


M. H. WIttinc. 
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